Amjad Izhar Blog

Category: Artificial Intelligence (AI)

Prompt Engineering Fundamentals
This course material introduces prompt engineering, focusing on practical application rather than rote memorization of prompts. It explains how large language models (LLMs) function, emphasizing the importance of understanding their underlying mechanisms—like tokens and context windows—to craft effective prompts. The course uses examples and exercises to illustrate how prompt design impacts LLM outputs, covering various techniques like using personas and custom instructions. It stresses the iterative nature of prompt engineering and the ongoing evolution of the field. Finally, the material explores the potential of LLMs and the ongoing debate surrounding artificial general intelligence (AGI).

Prompt Engineering Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What is the main focus of the course, according to the instructor?
2. Why is prompt engineering a skill, not a career, in the instructor’s opinion?
3. How did the performance of large language models change as they got larger?
4. What is multimodality, and what are four things a leading LLM can do?
5. What is the purpose of the playground mentioned in the course?
6. What are tokens, and how are they used by large language models?
7. What is temperature in the context of language models, and how does it affect outputs?
8. Explain the “reversal curse” phenomenon in large language models.
9. What are the two stages of training for large language models?
10. How does the system message influence the model’s behavior?
Quiz Answer Key
1. The main focus of the course is working with large language models, teaching how to use this new technology effectively in various aspects of work and life. It is not focused on selling pre-made prompts but on understanding the models themselves.
2. The instructor believes that prompt engineering is a skill that enhances any job, not a standalone career. He argues that it’s a crucial skill for efficiency, not a profession in itself.
3. As models increased in size, performance at certain tasks did not increase linearly but instead skyrocketed, with new abilities emerging that weren’t present in smaller models. This was an unexpected and non-linear phenomenon.
4. Multimodality is the ability of LLMs to understand and generate not only text, but also other modalities like images, the internet, and code. LLMs can accept and generate text, accept images, browse the internet, and execute python code.
5. The playground is a tool that allows users to experiment with and test the different settings of large language models. It is a space where one can fine-tune and better understand the model’s outputs.
6. Tokens are the way that LLMs understand and speak; they are smaller pieces of words that the model analyzes. LLMs determine the sequence of tokens most statistically probable to follow your input, based on training data.
7. Temperature is a setting that controls the randomness of the output of large language models. Lower temperature makes the output more predictable and formalistic, while higher temperature introduces randomness and can lead to creativity or gibberish.
8. The reversal curse refers to the phenomenon where an LLM can know a fact but fail to provide it when asked in a slightly reversed way. For example, it may know that Tom Cruise’s mother is Mary Lee Pfeiffer but not that Mary Lee Pfeiffer is Tom Cruise’s mother.
9. The two stages are pre-training and fine-tuning. In pre-training, the model learns patterns from a massive text dataset. In fine-tuning, a base model is adjusted to be an assistant, typically through supervised learning.
10. The system message acts as a “North Star” for the model, it provides a set of instructions or context at the outset that directs how the model should behave and interact with users. It is the model’s guiding light.
Essay Questions

Instructions: Answer the following questions in essay format. There is no single correct answer for any of the questions.
1. Discuss the concept of emergent abilities in large language models. How do these abilities relate to the size of the model, and what implications do they have for the field of AI?
2. Explain the Transformer model, and discuss why it was such a significant breakthrough in natural language processing. How has it influenced the current state of AI technologies?
3. Critically analyze the role of the system message in prompt engineering. In what ways can it be used to both enhance and undermine the functionality of an LLM?
4. Explore the role of context in prompt engineering, discussing both its benefits and potential pitfalls. How can prompt engineers effectively manage context to obtain the most useful outputs?
5. Discuss the various strategies employed throughout the course to trick or “break” an LLM. What do these strategies reveal about the current limitations of AI technology?
Glossary of Key Terms

Artificial Intelligence (AI): A broad field of computer science focused on creating intelligent systems that can perform tasks that typically require human intelligence.

Base Model: The initial output of the pre-training process in large language model development. It is a model that can do language completion, but is not yet conversational.

Context: The information surrounding a prompt, including previous conversation turns, relevant details, and additional instructions that help a model understand the task.

Context Window: The maximum number of tokens that a large language model can consider at any given time in a conversation. Also known as token limit.

Custom Instructions: User-defined instructions in platforms like ChatGPT that affect every conversation with a model.

Deep Learning: A subfield of machine learning that uses artificial neural networks with multiple layers to analyze data.

Emergent Abilities: Unforeseen abilities that appear in large language models as they scale up in size, which are not explicitly coded but rather learned.

Fine-Tuning: The process of adapting a base model to specific tasks and use cases, usually through supervised learning.

Large Language Model (LLM): A type of AI model trained on vast amounts of text data, used to generate human-like text.

Machine Learning: A subset of AI that enables systems to learn from data without being explicitly programmed.

Mechanistic Interpretability: The field of study dedicated to figuring out what’s happening when tokens pass through all the various layers of the model.

Multimodality: The ability of a language model to process and generate information beyond text, such as images, code, and internet browsing.

Natural Language Processing (NLP): A branch of AI that enables computers to understand, interpret, and generate human language.

Parameters: The internal variables of a large language model that it learns during training, affecting its ability to make predictions.

Persona: The role or identity given to a language model, which influences its tone, style, and the way it responds.

Pre-Training: The initial phase of large language model training, where the model is exposed to massive amounts of text data to learn patterns.

Prompt Engineering: The practice of designing effective prompts that can elicit the desired responses from AI models, particularly large language models.

System Message: The initial instructions or guidelines provided to a large language model by the model creator, which establishes its behavior and role. Also known as meta-prompt or system prompt.

Temperature: A parameter in large language models that controls the randomness of the output. Higher temperature leads to more diverse outputs, while lower temperatures produce more predictable responses.

Tokens: The basic units of text processing for large language models. They are often sub-word units that represent words, parts of words, or spaces.

Transformer Model: A neural network architecture that uses the “attention” mechanism to process sequences of data, such as text, enabling large language models to consider context over long ranges.

Prompt Engineering: Mastering Large Language Models

Okay, here is a detailed briefing document summarizing the key themes and ideas from the provided text, incorporating quotes where appropriate:

Briefing Document: Prompt Engineering Course Review

Introduction:

This document summarizes the main concepts discussed in a course focused on working with Large Language Models (LLMs), often referred to as “prompt engineering.” The course emphasizes practical application and understanding the mechanics of LLMs, rather than rote memorization of specific prompts. It highlights the importance of viewing prompt engineering as a multi-disciplinary skill, rather than a career in itself, for most individuals.

Key Themes and Ideas:
1. Prompt Engineering is More Than Just Prompts:
- The course emphasizes that true “prompt engineering” is not about memorizing or using pre-made prompts. As the instructor Scott states, “it’s not about teaching you 50 promps to boost your productivity…you’re going to learn to work with these large language models.”
- Scott believes that “there are plenty of people out there trying to sell you prompt libraries I think those are useless. They’re single prompts that are not going to produce exactly what you need for your work.” Instead, the course aims to teach how LLMs work “under the hood” so users can create effective prompts for their specific use cases.
1. Prompt Engineering as a Multi-Disciplinary Skill:
- The course defines prompt engineering as “a multi-disciplinary branch of engineering focused on interacting with AI through the integration of fields such as software engineering, machine learning, cognitive science like psychology, business, philosophy, computer science.”
- It stresses that “whatever your area of expertise is…you are going to be able to take that perspective and add it to the field.” This is because the field is new and constantly evolving.
1. Understanding How LLMs Work is Crucial:
- The core idea of the course is that to effectively use LLMs, you need to understand how they function internally. This includes concepts like tokens, parameters, and the Transformer architecture.
- “you need to understand what’s going on behind the scenes so that you can frame your prompt in the right light.”
- The course emphasizes that LLMs are not simply coded programs that have pre-set responses but rather “trained on data and after that training certain abilities emerged.”
- Emergent abilities, new capabilities that appear as models scale in size, demonstrate that these are not simply predictable increases in performance. This “scaling up the model linearly should increase performance linearly, but that’s not what happened.”
1. LLMs are not perfect:
- The course emphasizes that, despite the impressiveness of LLMs, they are still prone to making mistakes due to a few reasons including user error and their design.
- “it’s because we’re not dealing with code or a computer program here in the traditional sense. We’re dealing with a new form of intelligence, something that was trained on a massive data set and that has certain characteristics and limitations.”
- The concept of “hallucinating”, where the LLM produces confident yet false statements, is also important to keep in mind.
1. Multimodality and Capabilities:
- LLMs can handle more than just text. They can process and generate images, browse the internet (to access current information), and execute code, particularly Python code.
- “it can accept and generate text, it can accept images, it can generate images, it can browse the internet…and it can execute python code.”
- The course walks through an example of an LLM creating and refining a simple game by using Python.
1. Tokens are the Foundation:
- LLMs understand and “speak” in tokens, which are sub-word units, not whole words. “one token is equal to about 0.75 words”.
- The model determines the most statistically probable sequence of tokens based on its training data, giving the impression of “guessing” the next word.
- A high temperature setting increases the randomness when picking tokens, leading to more casual and sometimes nonsensical outputs, while a low temperature setting produces more formal output.
1. The Importance of Context and its Limitations:
- Providing sufficient context in prompts improves accuracy.
- However, there is a limitation to the amount of context LLMs can handle at a given time (the token or context window).
- “every time you send a prompt your entire conversation history is bundled up and packed on to the prompt…chat GPT is essentially constantly reminding of your entire conversation.”
- Once the context window fills, older information starts to be forgotten and accuracy can be compromised. This happens without the user necessarily realizing it.
- Information provided at the beginning of a prompt has a larger impact and is remembered better than information provided at the end, in effect creating a “Primacy Effect”. Information in the middle is more readily forgotten. This process mimics how the human brain handles context.
1. The Power of Personas:
- Giving an LLM a specific persona or role (“you are an expert mathematician,” or even a character such as Bilbo Baggins) provides it with crucial context and improves the quality of responses. This allows the user to better interact with and leverage LLMs.
- Personas are often set via the system message or by custom instructions.
1. Custom Instructions
- Users can provide instructions that the LLM uses as its “North Star” much in the same way as a system message.
- These “custom instructions” are used for any new chat, however users may forget about these instructions which may cause problems.
1. LLMs and “Secrets”:
- LLMs are not designed to keep secrets and are susceptible to being tricked into revealing private information given the right prompt.
- The way these LLMs “think” with tokens also enables the spilling of tea by crafting prompts that circumvent normal parameters.
1. The LLM Landscape:
- The course breaks down the LLM landscape into base models, which are trained on data and then further fine-tuned to create chatbot interfaces or domain specific models. The Transformer architecture enables LLMs to pay attention to and incorporate a wider range of context.
- Different companies, such as OpenAI, Anthropic, and Meta, create various models, including open-source ones like Llama 2.
Practical Applications:
- The course focuses on practical applications of prompt engineering. It uses examples such as making a game and generating music using an AI.
- The skills learned in the course can be used to create chatbots, generate code, understand complex documents, and make other helpful outputs to assist in work, study, or just general life.
Conclusion:

This course aims to provide a deep understanding of LLMs and how to effectively interact with them through thoughtful prompt engineering. It prioritizes practical knowledge, emphasizing that it is a “skill” rather than a “career” for most individuals, and that this skill is important for everyone. It is constantly updated with the latest techniques for effective prompting. By understanding the underlying mechanisms and limitations of these models, users can leverage their immense potential in their work and lives.

Prompt Engineering and Large Language Models

Prompt Engineering and Large Language Models: An FAQ
1. What exactly is “prompt engineering” and why is it important?
2. While the term “prompt engineering” is commonly used, it’s essentially about learning how to effectively interact with large language models (LLMs) to utilize their capabilities in various work and life situations. Instead of focusing on memorizing specific prompts, it’s about understanding how LLMs work so you can create effective instructions tailored to your unique needs. It’s a multi-disciplinary skill, drawing from software engineering, machine learning, psychology, business, philosophy, and computer science, and it is crucial for harnessing the full potential of AI for efficiency and productivity. It is considered more of a skill that enhances various roles, rather than a job in and of itself.
3. Why is prompt engineering necessary if LLMs are so advanced?
4. LLMs aren’t just programmed with specific answers; they learn from vast datasets and develop emergent abilities. Prompt engineering is necessary because we’re not dealing with traditional code or programs. We’re working with a form of intelligence that has been trained to predict the most statistically probable sequence of tokens, given the prompt and its training data. By understanding how these models process information, you can learn to frame your prompts in a way that leverages their understanding, yielding more accurate results. Also, prompting techniques can elicit abilities from models that might not be present when prompted in more basic ways.
5. Are prompt libraries or pre-written prompts helpful for prompt engineering?
6. While pre-written prompts can introduce you to what’s possible with LLMs, they are generally not very useful for true prompt engineering. Each user’s needs are unique, so generic prompts are unlikely to provide the results you need for your specific work. You’re better off learning the underlying principles of how to interact with LLMs than memorizing a collection of single-use prompts. It’s about developing an intuitive understanding of how to phrase requests, which enables you to naturally create effective prompts for your situation.
7. What is multimodality in the context of LLMs and how can it be used?
8. Multimodality refers to an LLM’s ability to understand and generate text, images, and even code. This goes beyond simple text inputs and outputs. LLMs can take images as prompts and give text responses to them, browse the internet to access more current data, or even execute code to perform calculations. This means prompts can incorporate diverse inputs and generate diverse outputs, greatly expanding the potential ways that LLMs can be used.
9. What is the “playground” and why might someone use it?
10. The playground is an interface provided by OpenAI (and other companies) that allows you to experiment directly with different LLMs, as well as test advanced settings and features such as temperature (for randomness) and the probability of the next token. It’s an important tool for advanced users to understand how the underlying technology works and to test techniques such as different system messages before implementing them into their products or day-to-day work with AI. It’s relatively inexpensive to use the playground and is a good place to go for more in-depth experimentation with AI tools.
11. What are “tokens” and why are they important?
12. Tokens are the fundamental units that LLMs use to understand and generate language. They’re like words, but LLMs actually break words down into smaller pieces. One token is approximately equivalent to 0.75 words. LLMs do not see words the way humans do; instead they see tokens that have a numerical ID which is part of a complex lookup table. The LLM statistically predicts the most probable sequence of tokens to follow your input, which is why it is often described as a ‘word guessing machine’. A word can consist of multiple tokens. Understanding this helps you see how LLMs are processing information on a basic level. This basic understanding of tokens will help guide your prompts more effectively.
13. What is the significance of “system messages” or “meta prompts” in prompt engineering?
14. A system message is an initial, often hidden, instruction or context that’s provided to the LLM before it interacts with the user. It acts as a “North Star” for the model, guiding its behavior, tone, and style. The system message determines how the model responds to user input and how it will generally interpret all user prompts. Understanding system messages is vital, particularly if you are developing an application that incorporates an LLM. System messages can be modified to tailor the model to various tasks or use cases, but it’s important to be aware that a model will always be pulled back to its original system message. Also, adding specific instructions to the system message will help the model with complex instructions that you want the model to remember for each and every interaction.
15. What is context, and why is it important when prompting, and why does the rule of more context being better not always hold up?
16. Context refers to all the information or details that accompany a prompt, including past conversation history, instructions or details within the prompt itself, and even the system message. More context usually leads to better, more accurate responses. However, LLMs have a limited “token window” (or a context window) which sets a maximum amount of text or context they can manage at any one time. When you exceed this limit, older context tokens are removed. It is imperative that the most important information or context is placed at the beginning of the context window because models have a tendency to pay more attention to the first and last part of a context window, and less to the information in the middle. Additionally, too much context can actually decrease the accuracy of an LLM, because the model will sometimes pay less attention to relevant information, or become bogged down by less relevant information.
Prompt Engineering: A Comprehensive Guide

Prompt engineering is a critical skill that involves developing and optimizing prompts to efficiently use artificial intelligence for specific tasks [1, 2]. It is not typically a standalone career but a skill set needed to use AI effectively [1, 3]. The goal of prompt engineering is to use AI to become more efficient and effective in work and life [2, 3].

Key aspects of prompt engineering include:
- Understanding Large Language Models (LLMs): It is essential to understand how LLMs work under the hood to effectively utilize them when prompting [3]. These models are not simply code; they have emergent abilities that arise as they grow larger [4, 5]. They are sensitive to how prompts are framed, and even slight changes can lead to significantly different responses [2].
- Prompts as Instructions: Prompts are essentially the instructions and context provided to LLMs to accomplish tasks [2]. They are like seeds that grow into useful results [2].
- Elements of a Prompt: A basic prompt has two elements: the input (the instruction) and the output (the model’s response) [6].
- Not Just About Productivity: Prompt engineering is not just about using pre-made prompts to boost productivity. Instead, it is about learning to work with LLMs to utilize them for specific use cases [3, 7, 8].
- Multi-Disciplinary Field: Prompt engineering integrates fields such as software engineering, machine learning, cognitive science, business, philosophy, and computer science [9].
- Importance of Empirical Research: The field is undergoing a lot of research, and prompt engineering should be based on empirical research that shows what works and what doesn’t [10].
- Hands-On Experience: Prompt engineering involves hands-on demos, exercises, and projects, including coding and developing prompts [10]. It requires testing, trying things out, and iterating until the right output is achieved [11, 12].
- Natural Language: Prompt engineering is like programming in natural language. Like programming, specific words and sequences are needed to get the right result [6].
- Beyond Basic Prompts: It’s more than just asking a question; it’s about crafting prompts to meet specific needs, which requires understanding how LLMs work [6, 7, 13].
Applied Prompt Engineering involves using prompt engineering principles in the real world to improve work, career, or studies [13, 14]. It includes using models to complete complex, multi-step tasks [8].

Why Prompt Engineering is Important:
- Maximizing Potential: It is key to using LLMs productively and efficiently to achieve specific goals [8].
- Avoiding Errors and Biases: Proper prompt engineering helps to minimize errors and biases in the model’s output [8].
- Programming in Natural Language: Prompt engineering is an example of programming using natural language [15].
- Future Workplace Skill: Prompt engineering skills will be essential in the workplace, just like Microsoft Word and Excel skills are today [3, 10]. A person with the same skills and knowledge but who also knows how to use AI through prompt engineering will be more effective [16].
Tools for Prompt Engineering:
- Chat GPT: The user interface to interact with LLMs [16, 17].
- OpenAI Playground: An interface for interacting with the OpenAI API that allows for more control over the LLM settings [16, 18].
- Replit: An online integrated development environment (IDE) to run coding applications [19].
Key Concepts in Prompt Engineering:
- Tokens: The way LLMs understand and speak. Words are broken down into smaller pieces called tokens [20].
- Attention Mechanism: This allows the model to pay more attention to more context [21, 22].
- Transformer Architecture: An architecture that allows the model to pay attention to more context, enabling better long-range attention [22, 23].
- Parameters: The “lines” and “dots” that enable the model to recognize patterns. LLMs compress data through parameters and weights [24, 25].
- Base Model: A model resulting from the pre-training phase, which is not a chatbot but rather a model that completes words or tokens [25].
- Fine-Tuning: The process of taking the base model and giving it additional text information so it can generate more helpful and specific output [25, 26].
- System Message: A default prompt provided to the model by its creator that sets the stage for interactions by including instructions or specific context [27]. It is like a North Star, guiding the model’s behavior [27, 28].
- Context: The additional information provided to the LLM that helps it better understand the task and respond accurately [29].
- Token Limits: LLMs have token limits, which are the maximum amount of words they can remember at any given time. This also acts as a context window [30, 31].
- Recency Effect: The effect of information being more impactful when given towards the end [32, 33].
- Personas: Giving the model a persona or role can help it provide better, more accurate responses [34, 35]. Personas work because they provide additional context [35].
This summary should provide a clear overview of what prompt engineering is and its key components.

Large Language Models: An Overview

Large Language Models (LLMs) are a type of machine learning model focused on understanding and generating natural language text [1, 2]. They are characterized by being trained on vast amounts of text data and having numerous parameters [2]. LLMs are a subset of Natural Language Processing (NLP), which is a branch of Artificial Intelligence focused on enabling computers to understand text and spoken words the same way human beings do [1, 3].

Here’s a more detailed breakdown of key aspects of LLMs:
- Size and Training: The term “large” in LLMs refers to the fact that these models are trained on massive datasets, often consisting of text from the internet [2, 4]. These models also have a large number of parameters, which are the “lines” and “dots” that enable the model to recognize patterns [4, 5]. The more tokens and parameters, the more capable a model generally is [6].
- Parameters: Parameters are part of the model’s internal structure that determine how it processes information [5, 7]. They can be thought of as the “neurons” in the model’s neural network [7].
- Emergent Abilities: LLMs exhibit emergent abilities, meaning that as the models become larger, new capabilities arise that weren’t present in smaller models [8, 9]. These abilities aren’t explicitly programmed but emerge from the training process [8].
- Tokens: LLMs understand and process language using tokens, which are smaller pieces of words, rather than the words themselves [10]. Each token has a unique ID, and the model predicts the next token in a sequence [11].
- Training Process: The training of an LLM typically involves two main phases:
- Pre-training: The model is trained on a large corpus of text data to learn patterns and relationships within the text [7]. This results in a base model [12].
- Fine-tuning: The base model is further trained using a more specific dataset, often consisting of ideal questions and answers, to make it better at completing specific tasks or behaving like a helpful assistant [12, 13]. The fine tuning process adjusts the parameters and weights of the model, which also impacts the calculations within the model and creates emergent abilities [13].
- Transformer Architecture: LLMs utilize a transformer architecture, which allows the model to pay attention to a wider range of context, improving its ability to understand the relationships between words and phrases, including those separated by large distances [6, 14]. This architecture helps enable better long-range attention [14].
- Context Window: LLMs have a limited context window, meaning they can only remember a certain number of tokens (or words) at once [15]. The token limit acts as a context window [16]. The context window is constantly shifting, and when a new prompt is given, the older information can be shifted out of the window, meaning that the model may not have all of the prior conversation available at any given time [15, 16]. Performance is best when relevant information is at the beginning or end of the context window [17].
- Word Guessing: At their core, LLMs are essentially “word guessing machines”, determining the most statistically probable sequence of tokens to follow a given prompt, based on their training data [11, 18].
- Relationship to Chatbots: LLMs are often used as the underlying technology for chatbots. For example, the GPT models from OpenAI are used by the ChatGPT chatbot [2, 19]. A chatbot is essentially a user interface or “wrapper” that makes it easy for users to interact with a model [20]. The system message provides a default instruction to the model created by the creator of the model [21]. Custom instructions can also be added to change the model’s behavior [22].
- Task-Specific Models: Some models are fine-tuned for specific tasks. For example, GitHub Copilot uses the GPT model but has been further fine-tuned for code generation [19, 20].
- Limitations: LLMs can sometimes provide incorrect or biased information, and they can also struggle with math [23, 24]. These models can also hallucinate (make things up) [25, 26]. They may also learn that A=B but not that B=A, which is known as the “reversal curse” [27]. Also, the model may only remember information in the context window and can forget information from the beginning of a conversation [16].
In summary, LLMs are sophisticated models that process and generate language using statistical probabilities, trained on extensive datasets and incorporating architectures that allow for better context awareness, but are also limited by context windows, and other factors, and may produce errors or biased results..

AI Tools and Prompt Engineering

AI tools, particularly those powered by Large Language Models (LLMs), are becoming increasingly prevalent in various aspects of work and life [1-4]. These tools can be broadly categorized based on their underlying model and specific functions [5, 6].

Here’s a breakdown of key aspects regarding AI tools, drawing from the sources:
- LLMs as the Foundation: Many AI tools are built upon LLMs like GPT from OpenAI, Gemini from Google, Claude from Anthropic, and Llama from Meta [5-8]. These models provide the core ability to understand and generate natural language [5, 6].
- Chatbots as Interfaces:
- Chatbots like ChatGPT, Bing Chat, and Bard use LLMs as their base [5, 6]. They act as a user interface (a “wrapper”) that allows users to interact with the underlying LLM through natural language [5, 6].
- The user interface makes it easier to input prompts and receive outputs [6]. Without it, interaction with an LLM would require code [6].
- Chatbots also have a system message, which is a default prompt that is provided by the chatbot’s creator to set the stage for interactions and guides the model [9, 10].
- Custom instructions can also be added to chatbots to further change the model’s behavior [11].
- Task-Specific AI Tools:
- These tools are designed for specific applications, such as coding, writing, or other domain-specific tasks [6, 7].
- Examples include GitHub Copilot, Amazon CodeWhisperer (for coding), and Jasper AI and Copy AI (for writing) [6, 7].
- They often use a base model that has been fine-tuned for their specific purposes [6, 7]. For example, GitHub Copilot uses a modified version of OpenAI’s GPT model fine-tuned for code generation [7].
- Task-specific tools may also modify the system message or system prompt to further customize the model’s behavior [6, 12].
- Custom AI Tools: AI tools can also be customized to learn a specific subject, improve mental health, or complete a specific task [13].
- Multimodality: Some advanced AI tools, like ChatGPT, can handle multiple types of input and output [14]:
- Text : They can generate and understand text [14].
- Images: They can accept images and generate images [14-16].
- Internet: They can browse the internet to gather more current information [17].
- Code: They can execute code, specifically Python code [17].
- Prompt Engineering for AI Tools:
- Prompt engineering is the key to using AI tools effectively [13].
- It helps maximize the potential of AI tools, avoid errors and biases, and ensure the tools are used efficiently [13].
- The skill of prompt engineering involves crafting prompts that provide clear instructions to the AI tool, guiding it to produce the desired output [4, 13].
- It requires an understanding of how LLMs work, including concepts like tokens, context windows, and attention mechanisms [2, 12, 18, 19].
- Effective prompts involve more than simply asking a question; they involve understanding the task, the capabilities of the AI tool, and the science of prompt engineering [4].
- Using personas and a unique tone, style and voice with AI tools can make them more intuitive for humans to use, improve their accuracy, and help them to be on brand [20, 21].
- By setting up a tool with custom instructions, it’s possible to effectively give the tool a new “North Star” or behavior profile [11, 22].
- Importance of Training Data: The effectiveness of an AI tool depends on the data it has been trained on [23]. The training process involves both pre-training on a vast amount of text data and then fine-tuning on a specific dataset to enhance its capabilities [24, 25].
In summary, AI tools are diverse and powerful, with LLMs acting as their core technology. These tools range from general-purpose chatbots to task-specific applications. Prompt engineering is a critical skill for maximizing the effectiveness of these tools, allowing users to tailor their behavior and output through carefully crafted prompts [13]. Understanding how LLMs function, and having clear and specific instructions are key for success in using AI tools [4, 12].

Prompt Engineering: Principles and Best Practices

Prompt engineering involves the development and optimization of prompts to effectively use AI for specific tasks [1]. It is a skill that can be used by anyone and everyone, regardless of their job or technical background [2]. The goal of prompt engineering is to use AI to become more efficient and effective in work by understanding how Large Language Models (LLMs) function [2]. It is a multi-disciplinary branch of engineering focused on interacting with AI through the integration of fields such as software engineering, machine learning, cognitive science, business, philosophy, and computer science [3, 4].

Key principles of prompt engineering include:
- Understanding LLMs: It’s important to understand how LLMs work under the hood, including concepts like tokens, the transformer architecture, and the context window [2]. LLMs process language using tokens, which are smaller pieces of words [5]. They also use a transformer architecture, allowing them to pay attention to more context [6].
- Prompts as Instructions: A prompt is essentially the instructions and context given to LLMs to accomplish a task [1]. It’s like a seed that you plant in the LLM’s mind that grows into a result [1]. Prompts are like coding in natural language, requiring specific words and sequences to get the right result [3].
- Prompt Elements: A basic prompt consists of two elements, an input (the question or instruction) and an output (the LLM’s response) [3].
- Iterative Process: Prompt engineering is an iterative process of testing, trying things out, evaluating, and adjusting until the desired output is achieved [7].
- Standard Prompts: The most basic type of prompt is the standard prompt, which consists only of a question or instruction [8]. These are important because they are often the starting place for more complex prompts, and can be useful for gathering information from LLMs [9].
- Importance of Context: Providing the LLM with more information or context generally leads to a better and more accurate result [10]. Context includes instructions, background information, and any other relevant details. It helps the LLM understand the task and generate a more helpful response. More context means more words and tokens for the model to analyze, causing the attention mechanism to focus on relevant information and reducing the likelihood of errors [11]. However, providing too much context can also be detrimental, as LLMs have token limits [12, 13].
- Context Window: LLMs have a limited context window (also known as a token limit), which is the number of tokens (or words) the model can remember at once [12, 13]. Once that limit is reached, the model will forget information from the beginning of the conversation. Therefore, it is important to manage the context window to maintain the accuracy and coherence of the model’s output [12].
- Primacy and Recency Effects: Information placed at the beginning or end of a context window is more likely to be accurately recalled by the model, while information in the middle can get lost [14-16]. For this reason, place the most important context at the beginning of a prompt [16].
- Personas: Giving an LLM a persona or role can provide additional context to help it understand the task and provide a better response [17-19]. Personas help to prime the model to think in a certain way. Personas can be functional and fun [20, 21].
- Tone, Style, and Voice: A persona can also include a specific tone, style, and voice that are unique to the task, which can help produce more appropriate and nuanced outputs [21].
- Custom Instructions: Custom instructions are a way to give the model more specific information about what you want it to know or how you want it to respond [21]. This is similar to giving the model a sub system message.
In summary, prompt engineering is about understanding how LLMs work and applying that understanding to craft effective prompts that guide the model toward accurate, relevant, and helpful outputs. By paying attention to detail and incorporating best practices, users can achieve much more with LLMs and tailor them to meet their specific needs and preferences [22].

Mastering Prompt Engineering with LLMs

This course provides an in-depth look at prompt engineering and how to work with large language models (LLMs) [1]. The course emphasizes gaining practical, real-world skills to put you at the forefront of the AI world [1]. It aims to teach you how to use AI to become more efficient and effective in your work [2]. The course is taught by Scott Kerr, an AI enthusiast and practitioner [1].

Here’s an overview of the key components of the course:
- Focus on Practical Skills: The course focuses on teaching how to work with LLMs for specific use cases, rather than providing a library of pre-made prompts [2]. It emphasizes learning by doing, with numerous exercises and projects, including guided and unguided projects [1]. The projects include coding games and using autonomous agents, among other tasks [3].
- Understanding LLMs: A key part of the course involves diving deep into the mechanics of LLMs, understanding how they work under the hood, and using that knowledge when prompting them [2].
- This includes understanding how LLMs use tokens [4], how they use the transformer architecture [5], and the concept of a context window [6].
- The course also covers the training process of LLMs and the difference between base models and assistant models [7].
- Prompt Engineering Principles: The course teaches prompt engineering as a multi-disciplinary branch of engineering that requires integrating fields such as software engineering, machine learning, cognitive science, business, philosophy, and computer science [8]. The course provides a framework for creating complex prompts [9]
- Standard Prompts: The course starts with the most basic prompts, standard prompts, which are a single question or instruction [10].
- Importance of Context: The course teaches the importance of providing the LLM with more information or context, which includes providing relevant instructions and background information to get more accurate results [11].
- The course emphasizes placing key information at the beginning or end of the prompt for best results [12].
- Managing the Context Window: The course emphasizes the importance of managing the limited context window of the LLMs, to maintain accuracy and coherence [6].
- System Messages: The course discusses the importance of the system message, which acts as the “North Star” for the model, and it teaches users how to create their own system message for specific purposes [13].
- Personas: The course teaches the use of personas to give LLMs a specific role, tone, style and voice, to make them more useful for humans to use [14, 15].
- Applied Prompt Engineering: The course emphasizes using prompt engineering principles in real-world scenarios to make a difference in your work [16]. The course shows the difference in responses between a base model and an assistant model, using LM Studio, to emphasize the importance of applied prompt engineering [7].
- Multimodality: The course introduces the concept of multimodality and how models like Chat-GPT can understand and produce images as well as text, browse the internet, and execute python code [17-19].
- Tools and Set-Up: The course introduces different LLMs, including the GPT models by Open AI, which can be used through chat-GPT [20]. It also teaches how to use the Open AI playground to interact with the models [20, 21]. The course also emphasizes the importance of using the chat-GPT app to use on a daily basis [22].
- Emphasis on Empirical Research: The course is grounded in empirical research and peer-reviewed studies conducted by AI researchers [3].
- Up-to-Date Information: The course is designed to provide the most up-to-date information in a constantly changing field and is dedicated to continually evolving [23].
- Projects and Exercises: The course includes hands-on demos, exercises, and guided and unguided projects to develop practical skills [3]. These include coding games and using autonomous agents [1].
- Evaluation: The course introduces the concept of evaluating and testing prompts, because in order to be scientific, the accuracy and success of prompts needs to be measurable [24].
In summary, the course is structured to provide a blend of theoretical knowledge and practical application, aiming to equip you with the skills to effectively utilize LLMs in various contexts [1]. It emphasizes a deep understanding of how these models work and the best practices for prompt engineering, so that you can use them to your advantage.

Learn Prompt Engineering: Full Beginner Crash Course (5 HOURS!)

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 24, 2025
Harvard CS50’s Artificial Intelligence with Python – Full University Course
This source explains how AI can be used for problem-solving, moving from explicit instructions to learning from data. It introduces supervised learning, where AI learns to map inputs to outputs using labeled datasets, covering classification tasks and nearest neighbor algorithms. The source also discusses linear regression, support vector machines, and techniques like perceptron learning. It transitions to reinforcement learning, where AI learns through rewards and punishments in an environment, and touches on unsupervised learning with clustering techniques like k-means. Finally, the document explores neural networks, detailing their structure, training via gradient descent and backpropagation, and their applications in various AI problems.

Propositional Logic, Model Checking, and Beyond: A Comprehensive Study Guide

I. Review of Key Concepts
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Propositional Symbols: Variables representing simple statements that can be either true or false (e.g., P, Q, R).
- Logical Connectives: Symbols used to combine propositional symbols into more complex statements:
- and (∧): Both statements must be true for the combined statement to be true.
- or (∨): At least one statement must be true for the combined statement to be true.
- not (¬): Reverses the truth value of a statement.
- implies (→): If the first statement is true, then the second statement must also be true.
- biconditional (↔): Both statements have the same truth value (both true or both false).
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Query (α): A question about the world that we want to answer using the KB.
- Entailment (KB ⊨ α): The relationship between the KB and a query, meaning that the KB logically implies the query; whenever the KB is true, the query must also be true.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language. Represents a possible world or state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Inference Rules: Logical equivalences used to manipulate and simplify logical expressions (e.g., implication elimination, De Morgan’s laws, distributive law).
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Completeness: An inference algorithm is complete if it can derive all conclusions that are entailed by the KB.
- Conjunctive Normal Form (CNF): A logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals.
- Clause: A disjunction of literals (e.g., P or not Q or R).
- Literal: A propositional symbol or its negation (e.g., P, not Q).
- Resolution: An inference rule that combines two clauses containing complementary literals to produce a new clause.
- Factoring: Removing duplicate literals within a clause.
- Empty Clause: The result of resolving two contradictory clauses, representing a contradiction (always false).
- Inference by Resolution: An algorithm for proving entailment by converting the KB and the negation of the query to CNF, and then repeatedly applying the resolution rule until the empty clause is derived.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Formula: A formula for calculating the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B).
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Conditioning: Expressing the probability of A in terms of the conditional probability of A given B and the probability of B: P(A) = P(A|B) * P(B) + P(A|¬B) * P(¬B).
- Conditional Probability: The probability of event A occurring given that event B has already occurred, denoted P(A|B).
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Heuristic Function: An estimate of the “goodness” of a state (e.g., the distance to the goal).
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Hill Climbing: A local search algorithm that repeatedly moves to the neighbor with the highest value.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stochastic Hill Climbing: Chooses a neighbor randomly from the neighbors that are better than the current state.
- First Choice Hill Climbing: Chooses the first neighbor with a higher value and moves there.
- Random Restart Hill Climbing: Runs hill climbing multiple times with different initial states and returns the best result.
- Local Beam Search: Keeps track of k best states and expands all of them in each iteration.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Delta E (ΔE): The difference in value (or cost) between the current state and a neighboring state.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Objective Function: A mathematical expression to be minimized or maximized in linear programming.
- Constraints: Restrictions or limitations on the values of variables in linear programming.
- Constraint Satisfaction Problem (CSP): A problem where the goal is to find values for a set of variables that satisfy a set of constraints.
- Variables (in CSP): Entities with associated domains of possible values.
- Domains (in CSP): The set of possible values that can be assigned to a variable.
- Constraints (in CSP): Restrictions on the values that variables can take, specifying allowable combinations of values.
- Unary Constraint: A constraint involving only one variable.
- Binary Constraint: A constraint involving two variables.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the variable’s unary constraints.
- Arc Consistency: Ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables.
- AC3: A common algorithm for enforcing arc consistency.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated.
- Minimum Remaining Values (MRV) Heuristic: A variable selection strategy that chooses the variable with the fewest remaining legal values.
- Degree Heuristic: A variable selection strategy that chooses the variable involved in the largest number of constraints on other unassigned variables.
- Least Constraining Value Heuristic: A value selection strategy that chooses the value that rules out the fewest choices for neighboring variables in the constraint graph.
- Supervised Machine Learning: A type of machine learning where an algorithm learns from labeled data to make predictions or classifications.
- Inputs (x): The features or attributes used by a machine learning model to make predictions.
- Outputs (y): The target variables or labels that a machine learning model is trained to predict.
- Hypothesis Function (h): A mathematical function that maps inputs to outputs.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature.
- Learning Rate (α): A parameter that controls the step size during training.
- Threshold Function: A function that outputs one value if the input is above a threshold and another value if the input is below the threshold.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Dot Product: A mathematical operation that multiplies corresponding elements of two vectors and sums the results.
- Gradient Descent: An iterative optimization algorithm for finding the minimum of a function.
- Stochastic Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a single randomly chosen data point.
- Mini-Batch Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a small batch of data points.
- Neural Networks: A type of machine learning model inspired by the structure of the human brain, consisting of interconnected nodes (neurons) organized in layers.
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Natural Language Processing (NLP): The branch of AI that deals with the interaction between computers and human language.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Semantics: The meaning of words, phrases, and sentences in a language.
- Formal Grammar: A set of rules for generating sentences in a language.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side.
- Terminal Symbol: A symbol that represents a word in a language.
- Non-Terminal Symbol: A symbol that represents a phrase or category of words in a language.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Noun Phrase: A phrase that functions as a noun.
- Verb Phrase: A phrase that functions as a verb.
- Natural Language Toolkit (NLTK): A Python library for NLP.
- Parsing: The process of analyzing a sentence according to the rules of a grammar.
- Syntax Tree: A hierarchical representation of the structure of a sentence.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- n-gram: A contiguous sequence of n items from a sample of text.
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens).
- Text Classification: The task of assigning a category label to a text.
- Sentiment Analysis: Determining the emotional tone or attitude expressed in a piece of text.
- Bag-of-Words Model: A text representation that represents a document as the counts of its words, disregarding grammar and word order.
- Term Frequency (TF): The number of times a term appears in a document.
- Inverse Document Frequency (IDF): A measure of how rare a term is across a collection of documents.
- TF-IDF: A weight used in information retrieval and text mining that reflects how important a word is to a document in a corpus.
- Stop Words: Common words that are often removed from text before processing.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Distributed Representation: A vector representation where the meaning of a word is distributed across multiple values.
- Word2Vec: A model for learning word embeddings.
II. Short Answer Quiz
1. Explain the difference between soundness and completeness in the context of inference algorithms. Soundness means that any conclusion drawn by the algorithm is actually entailed by the knowledge base. Completeness means that the algorithm is capable of deriving every conclusion that is entailed by the knowledge base.
2. Describe the process of converting a logical sentence into Conjunctive Normal Form (CNF). The process involves eliminating bi-conditionals and implications, moving negations inward using De Morgan’s laws, and using the distributive law to get a conjunction of clauses where each clause is a disjunction of literals.
3. What is the purpose of using the resolution inference rule in propositional logic? The resolution rule is used to derive new clauses from existing ones, aiming to ultimately derive the empty clause, which indicates a contradiction and proves entailment.
4. Explain the marginalization rule and provide a simple example. Marginalization calculates the probability of a variable by summing over all possible values of other variables. For example, if you want to know the probability that someone likes ice cream, you would take the probability of them liking ice cream and liking chocolate times the probability that they like chocolate.
5. What is the key idea behind local search algorithms? Local search algorithms start with an initial state and iteratively improve it by moving to neighboring states, based on some evaluation function, without necessarily keeping track of the path taken to reach the solution.
6. Describe how simulated annealing helps to avoid local optima. Simulated annealing accepts worse neighbors with a probability that decreases over time, allowing the algorithm to escape local optima early in the search and converge towards a global optimum later.
7. In linear programming, what are the roles of the objective function and constraints? The objective function is what we want to minimize or maximize, while constraints are limitations on the values of variables that must be satisfied.
8. What is the purpose of enforcing arc consistency in a constraint satisfaction problem (CSP)? Enforcing arc consistency reduces the domains of variables by removing values that cannot be part of any solution due to binary constraints, making the search for a solution more efficient.
9. Explain the difference between a one-hot representation and a distributed representation in NLP. A one-hot representation represents a word as a vector with a 1 in the corresponding index and 0s elsewhere, while a distributed representation distributes the meaning of a word across multiple values in a vector.
10. How do word embedding models like Word2Vec capture semantic relationships between words? Word2Vec captures semantic relationships by training a model to predict the context words surrounding a given word in a large corpus, resulting in vector representations where similar words are located close to each other in vector space.
III. Essay Questions
1. Compare and contrast model checking and inference by resolution as methods for determining entailment in propositional logic. Discuss the advantages and disadvantages of each approach.
2. Explain how local search algorithms can be applied to solve optimization problems. Discuss the challenges of local optima and describe techniques, such as simulated annealing, for overcoming these challenges.
3. Describe the general framework of a constraint satisfaction problem (CSP). Discuss the role of variable and value selection heuristics in improving the efficiency of backtracking search for solving CSPs.
4. Explain the process of training a machine learning model for sentiment analysis. Discuss the different text representation techniques, such as bag-of-words and TF-IDF, and the role of word embeddings.
5. Describe the key concepts in Natural Language Processing (NLP), including syntax and semantics. Discuss how NLP techniques are used to understand and generate natural language.
IV. Glossary of Key Terms
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity, enabling the network to learn complex patterns.
- Arc Consistency: A constraint satisfaction technique ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables based on the problem constraints.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated, allowing the algorithm to systematically search the solution space.
- Bag-of-Words Model: A text representation in NLP that represents a document as the counts of its words, disregarding grammar and word order, which helps quantify the content of texts for analysis.
- Clause: In logic, it is the statement that combines different literals with “or” relationship.
- Complete: An inference algorithm that can derive all conclusions entailed by the KB.
- Conditioning: A probability rule that expresses the probability of one event in terms of its conditional probability, and this rule is used to find the probabilities that are unknown with the information given.
- Conjunctive Normal Form (CNF): A standardized logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals, simplifying logical deductions.
- Constraints: Limitation to the conditions of the variables in linear programing or constraint satisfaction problems.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side, used to define the syntax of programming languages.
- Delta E (ΔE): The difference in value between the current state and its neighboring states.
- Distributed Representation: It describes the meaning of the representation of a word distributing over multiple values in vector which is the idea behind the word embedding technique.
- Domain: The set of possible values that can be assigned to a variable.
- Entailment (KB ⊨ α): KB logically implies that α; whenever KB is true, so does α, which is the relationship that is important when the machine needs to find if the conclusion is correct or not.
- Formal Grammar: A set of rules for generating sentences in a language, and those rules are applied in order to find what it is that is trying to be said in language analysis.
- Heuristic Function: It estimates the ‘goodness’ of a state (e.g., the distance to the goal), which will let machine learning models take efficient and near perfect results.
- Hill Climbing: This iterative optimization algorithm is characterized by continuously searching to find better solution while moving to a better neighbor and also have the highest value.
- Hypothesis Function (h): This function maps inputs to outputs and can be used to learn and predict.
- Inclusion-Exclusion Formula: Used to find the P(A or B), in which it finds the P(A), P(B), P(A and B), and finds P(A)+P(B)-P(A and B) in result.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Learning Rate (α): It controls the step size during the machine learning algorithm.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Literal: A propositional symbol or its negation (e.g., P, not Q) that describes the condition of a statement.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event, allowing modeling of sequences over time.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language that represents the state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- n-gram: A contiguous sequence of n items from a sample of text that helps in analyzing languages and predicting text.
- Natural Language Processing (NLP): The field of AI that is related to the understanding of human language.
- Noun Phrase: A phrase that functions as a noun to use for language parsing.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Objective Function: An mathematical function to be minimized or maximized in linear programming.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Parsing: This process of taking a sentence and analyzing it according to grammar rules in NLP.
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Query (α): The question that we want to answer using the KB.
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Semantics: the meaning of words, phrases, and sentences in a language, which helps with extracting the insights and understanding of language.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stop Words: Common words that are often removed from text before processing.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Syntax Tree: A hierarchical representation of the structure of a sentence, used to know how a structure looks with a graphical approach.
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens), which allows for language parsing and to read for machines.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- Unary Constraint: A constraint involving only one variable.
- Verb Phrase: A phrase that functions as a verb to be analyzed in parsing.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature, letting it know the emphasis on each feature.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- Word2Vec: A model for learning word embeddings by knowing what words mean, learning and classifying similar words.
AI: Reasoning, Search, NLP, and Learning Techniques

Here’s a briefing document summarizing the main themes and ideas from the provided sources.

Briefing Document: Artificial Intelligence – Reasoning, Search, and Natural Language Processing

Overview:

The sources cover several fundamental concepts in Artificial Intelligence (AI), including logical reasoning, search algorithms, probabilistic reasoning, and natural language processing (NLP). They explore techniques for representing knowledge, drawing inferences, solving problems through search, handling uncertainty, and enabling computers to understand and generate human language.

I. Logical Reasoning and Inference:
- Entailment and Inference Algorithms: The core idea is that AI systems should be able to determine if a knowledge base (KB) entails a query (alpha). This means: “Given some query about the world…the question we want to ask…is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base…can we conclude that this sentence alpha is true?”
- Model Checking: This is a basic inference algorithm. It involves enumerating all possible models (assignments of truth values to variables) and checking if, in every model where the knowledge base is true, the query (alpha) is also true. “If we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models…And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha.”
- Inference Rules: These are logical transformations used to derive new knowledge from existing knowledge. Examples include:
- Implication Elimination: alpha implies beta can be transformed into not alpha or beta. “This is a way to translate if-then statements into or statements… if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta”
- Biconditional Elimination: a if and only if b becomes a implies b and b implies a.
- De Morgan’s Laws: These laws relate ANDs and ORs through negation. not (alpha and beta) is equivalent to not alpha or not beta. And not (alpha or beta) is equivalent to not alpha and not beta. “If it is not true that alpha and beta, well, then either not alpha or not beta… if you have a negation in front of an and expression, you move the negation inwards, so to speak…and then flip the and into an or.”
- Distributive Law: alpha and (beta or gamma) is equivalent to (alpha and beta) or (alpha and gamma).
- Conjunctive Normal Form (CNF): A standard form for logical sentences where it is represented as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals (propositional symbols or their negations). “A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses…a conjunction of clauses means it is an and of individual clauses, each of which has ors in it.”
- Resolution: An inference rule that applies to clauses in CNF. If you have P or Q and not P or R, you can resolve them to get Q or R. This involves dealing with factoring (removing duplicate literals) and the empty clause (representing a contradiction). “…if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion.”
- Inference by Resolution: To prove that a knowledge base entails a query (alpha), we assume not alpha and try to derive a contradiction (the empty clause) using resolution. “We want to prove that our knowledge base entails some query alpha…we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction…To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form”
II. Search Algorithms:
- Search Problems: Defined by an initial state, actions, a transition model, a goal test, and a path cost function.
- Local Search: Algorithms that operate on a single current state and move to neighbors. They don’t care about the path to the solution.
- Hill Climbing: A simple local search algorithm that repeatedly moves to the neighbor with the highest value (or lowest cost). It suffers from problems with local maxima/minima. “Generally, what hill climbing is going to do is it’s going to consider the neighbors of that state…and pick the highest one I can…continually looking at all of my neighbors and picking the highest neighbor…until I get to a point…where I consider both of my neighbors and both of my neighbors have a lower value than I do.”
- Variations: Steepest ascent, stochastic, first choice, random restart, local beam search.
- Simulated Annealing: A local search algorithm that sometimes accepts worse moves to escape local optima. The probability of accepting a worse move depends on the “temperature” and the difference in cost (delta E). “whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse…And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability.”
- Linear Programming: A family of problems where the goal is to minimize a cost function subject to linear constraints. “the goal of linear programming is to minimize a cost function…subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value”
III. Constraint Satisfaction Problems (CSPs):
- Definition: Problems defined by variables, domains (possible values for each variable), and constraints.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the unary constraints (constraints involving only that variable). “…we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it.”
- Arc Consistency: Ensuring that all values in a variable’s domain satisfy the binary constraints (constraints involving two variables). “In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y.”
- AC3: An algorithm for enforcing arc consistency across an entire CSP. It maintains a queue of arcs and revises domains to ensure consistency. “AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem…It’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent.”
- Backtracking Search: A depth-first search algorithm for solving CSPs. It assigns values to variables one at a time, backtracking when a constraint is violated.
- Minimum Remaining Values (MRV): A heuristic for variable selection that chooses the variable with the fewest remaining legal values in its domain. “Select the variable that has the fewest legal values remaining in its domain…In the example of the classes and the exam slots, you would prefer to choose the class that can only meet on one possible day.”
- Degree Heuristic: A heuristic used to select what the best variable will be. “The general approach is that in cases of ties, where two or more of the classes each can only have one possible day of the exam left, we want to choose the one that is involved in the most constraints, the one that we expect to potentially have the bigger impact on the overall problem”
- Least Constraining Value: A heuristic for value selection that chooses the value that rules out the fewest choices for neighboring variables. “Loop over the values in the domain that we haven’t yet tried and pick the value that rules out the fewest values from the neighboring variables.”
IV. Probabilistic Reasoning:
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Principle: Used to calculate the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B). Deals with the problem of overcounting when calculating probabilities.
- Marginalization: A rule used to calculate the probability of a variable by summing over all possible values of other variables. “I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on…I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj.”
- Conditioning: Similar to marginalization, but uses conditional probabilities instead of joint probabilities.
V. Supervised Learning:
- Hypothesis Function: A function that maps inputs to outputs. In supervised learning the input consists of a set of labeled data points, each with multiple features and one associated value, or ‘label’. The job of supervised learning is to ‘learn’ a model that correctly maps an input consisting of a data point with multiple features to a corresponding output.
- Weights: Parameters of the hypothesis function that determine the importance of different input features. “We’ll generally call that number a weight for how important should these variables be in trying to determine the answer.”
- Threshold Function: A function that outputs one category if the weighted sum of inputs is above a threshold and another category otherwise. “If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain.”
- Logistic Regression: Uses a logistic function (sigmoid) instead of a hard threshold, allowing for probabilistic outputs between 0 and 1. “Instead of using this hard threshold type of function, we can use instead a logistic function…And as a result, the possible output values are no longer just 0 and 1…But you can actually get any real numbered value between 0 and 1.”
- Gradient Descent: An iterative optimization algorithm used to find the optimal weights for a model by repeatedly updating the weights in the direction of the negative gradient of the cost function. “And we can use gradient descent to train a neural network, that gradient descent is going to tell us how to adjust the weights to try and lower that overall cost on all the data points.”
- Stochastic Gradient Descent: Updates the weights based on a single randomly chosen data point at each iteration.
- Mini-Batch Gradient Descent: Updates the weights based on a small batch of data points at each iteration.
- Neural Networks: A network of interconnected nodes (neurons) organized in layers. Each connection has a weight. Neural networks take an input and ‘learn’ to modify the weight of each connection to accurately map an input to an output. A simple neural network consists of an input layer and an output layer, while more complex neural networks consist of several hidden layers between input and output. “we create a network of nodes…and if we want, we can connect all of these nodes together such that every node in the first layer is connected to every node in the second layer…And each of these edges has a weight associated with it.”
- Activation Function: A function applied to the output of each node in a neural network to introduce non-linearity. “You take the inputs, you multiply them by the weights, and then you typically are going to transform that value a little bit using what’s called an activation function.”
- Multi-Class Classification: A classification problem with more than two categories. Can be handled using neural networks with multiple output nodes, each representing the probability of belonging to a particular class.
VI. Natural Language Processing (NLP):
- Syntax: The structure of language.
- Semantics: The meaning of language. “While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means.”
- Formal Grammar: A system of rules for generating sentences in a language.
- Context-Free Grammar (CFG): A type of formal grammar that defines rules for rewriting non-terminal symbols into terminal symbols (words) or other non-terminal symbols. “a context-free grammar is some system of rules for generating sentences in a language…We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language.”
- NLTK (Natural Language Toolkit): A Python library for NLP tasks.
- N-grams: Contiguous sequences of n items (characters or words) from a sample of text.
- Tokenization: The process of splitting a sequence of characters into pieces, such as words.
- Markov Chain: A sequence of values where one value can be predicted based on the preceding values. Can be used for language generation. “Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it…we can use that to predict what word might come next in a sequence of words.”
- Text Classification: The problem of assigning a category or label to a piece of text.
- Sentiment Analysis: A specific text classification task that involves determining the sentiment (positive, negative, neutral) of a piece of text.
- Bag of Words: A representation of text as a collection of words, disregarding grammar and word order, but keeping track of word frequencies. “With the bag of words representation, I’m just going to keep track of the count of every single word, which I’m going to call features.”
- TF-IDF (Term Frequency-Inverse Document Frequency): A weighting scheme that assigns higher weights to words that are frequent in a document but rare in the overall corpus.
- One-Hot Representation: A vector representation of a word where one element is 1 and all other elements are 0. “Each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0.”
- Distributed Representation: A vector representation of a word where the meaning is distributed across multiple values, ideally in such a way that similar words have similar vector representations.
- Word Embeddings: Distributed representations of words that capture semantic relationships.
- Word2Vec: A model for generating word embeddings based on the context in which words appear. “we’re going to define the meaning of a word based on the words that appear around it, the context words around it…we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning.”
This briefing document provides a high-level overview of the concepts covered in the sources. It highlights key definitions, algorithms, and techniques used in AI.

NLP, ML, and Problem Solving: FAQ

Natural Language Processing, Machine Learning and Problem Solving: FAQ

1. What is the core concept of “entailment” in the context of knowledge bases and inference algorithms, and how does model checking help determine entailment?

Entailment refers to whether a knowledge base (KB) logically implies a query (alpha). In other words, can you conclude that alpha is true solely based on the information within the KB? Model checking is an algorithm that answers this by enumerating all possible models (assignments of true/false to propositional symbols). If, in every model where the KB is true, alpha is also true, then the KB entails alpha. Essentially, it exhaustively checks if alpha must be true whenever the KB is true.

2. Explain the model checking algorithm, including how it enumerates models and determines if a knowledge base entails a query.

The model checking algorithm involves the following steps:
1. Enumerate all possible models: List every possible combination of truth values (true or false) for all propositional symbols in the knowledge base and query.
2. Evaluate the knowledge base in each model: Determine if the knowledge base (KB) is true or false in each of the enumerated models.
3. Check the query in models where the KB is true: For every model where the KB is true, check if the query (alpha) is also true.
- Determine entailment:If alpha is true in every model where the KB is true, then the KB entails alpha.
- If there exists at least one model where the KB is true but alpha is false, then the KB does not entail alpha.
3. What are inference rules in propositional logic, and give examples of implication elimination, biconditional elimination, and De Morgan’s laws?

Inference rules are logical equivalences that allow you to transform logical sentences into different, but logically equivalent, forms. This is useful for drawing new conclusions from existing knowledge. Here are some examples:
- Implication Elimination: alpha implies beta is equivalent to not alpha or beta. This replaces an implication with an OR statement.
- Biconditional Elimination: alpha if and only if beta is equivalent to (alpha implies beta) and (beta implies alpha). This breaks down a biconditional into two implications.
- De Morgan’s Laws:not (alpha and beta) is equivalent to not alpha or not beta. The negation of a conjunction is the disjunction of the negations.
- not (alpha or beta) is equivalent to not alpha and not beta. The negation of a disjunction is the conjunction of the negations.
4. Describe the conjunctive normal form (CNF) and explain the steps to convert a logical formula into CNF.

Conjunctive Normal Form (CNF) is a standard logical format where a sentence is represented as a conjunction (AND) of clauses, and each clause is a disjunction (OR) of literals. A literal is either a propositional symbol or its negation. The steps to convert a formula to CNF are:
1. Eliminate Biconditionals: Replace all alpha <-> beta with (alpha -> beta) ^ (beta -> alpha).
2. Eliminate Implications: Replace all alpha -> beta with ~alpha v beta.
3. Move Negations Inwards: Use De Morgan’s laws to move negations inward, so they apply only to literals (e.g., ~ (alpha ^ beta) becomes ~alpha v ~beta).
4. Distribute ORs over ANDs: Use the distributive law to transform the expression into a conjunction of clauses (e.g., alpha v (beta ^ gamma) becomes (alpha v beta) ^ (alpha v gamma)).
5. Explain the resolution inference rule and the resolution algorithm for proving entailment. What is “inference by resolution,” and how does the empty clause relate to contradiction?

The resolution inference rule states that if you have two clauses, alpha OR beta and ~alpha OR gamma, you can infer beta OR gamma. It essentially eliminates a complementary pair of literals (alpha and ~alpha) and combines the remaining literals into a new clause. “Inference by resolution” uses this rule repeatedly to derive new clauses.

The resolution algorithm for proving entailment involves:
1. Negate the query: To prove KB entails alpha, assume ~alpha.
2. Convert to CNF: Convert KB AND ~alpha into CNF.
3. Resolution Loop: Repeatedly apply the resolution rule to pairs of clauses in the CNF formula. Add any new clauses generated back into the set of clauses. If factoring is needed, remove any duplicate literals in resulting clause.
4. Check for Empty Clause: If, at any point, you derive the “empty clause” (a clause with no literals, representing “false”), this means you’ve found a contradiction.
5. Determine Entailment: If you derive the empty clause, then KB entails alpha (because KB AND ~alpha leads to a contradiction, so it must be the case that if KB is true, then alpha must be true). If you can no longer derive new clauses and haven’t found the empty clause, then KB does not entail alpha.
The empty clause signifies a contradiction because it represents a situation where both P and NOT P are true, which is impossible. Finding the empty clause through resolution proves that the initial assumption (the negated query) was inconsistent with the knowledge base.

6. Explain the inclusion-exclusion principle and the marginalization rule in probability theory, providing examples of their application.
- Inclusion-Exclusion Principle: This principle calculates the probability of A OR B. The formula is: P(A or B) = P(A) + P(B) – P(A and B). It is used to correct for over counting when calculating P(A or B).
- Example: The probability of rolling a 6 on a red die (A) OR a 6 on a blue die (B). If you just add P(A) + P(B), you’re double-counting the case where both dice show 6. Subtracting P(A and B) (the probability of both dice showing 6) corrects for this.
- Marginalization Rule: This rule allows you to calculate the probability of one variable (A) by summing over all possible values of another variable (B). The formula is: P(A) = Σ P(A and B).
- Example: Probability of it being cloudy (A), given the joint distribution of cloudiness and raininess (B). We calculate P(cloudy) by summing P(cloudy and rainy) + P(cloudy and not rainy). We consider all possible cases that take place, and then look at the probability that the probability of A happens in each of the cases. This is useful for finding an individual (unconditional) probability from a joint probability distribution.
7. Describe the hill climbing algorithm, including its pseudocode, and discuss its limitations (local optima). Also explain variations like stochastic hill climbing and random restart hill climbing.

The hill climbing algorithm is a local search technique used to find a maximum (or minimum) of a function. Its pseudocode is as follows:
1. Start with a current state (often random).
2. Loop: a. Find the neighbor of the current state with the highest (or lowest) value. b. If the neighbor is better than the current state, move to the neighbor ( current = neighbor). c. If the neighbor is not better, terminate and return the current state.
A major limitation of hill climbing is that it can get stuck in local optima: points that are better than their immediate neighbors but not the best overall solution.

Variations:
- Stochastic Hill Climbing: Randomly choose a neighbor with a better value, rather than always picking the best neighbor. This can help escape plateaus (areas of the search space with relatively equal value), but not always a local optimum.
- Random Restart Hill Climbing: Run the hill climbing algorithm multiple times from different random starting states. Keep track of the best solution found across all runs. This increases the chance of finding the global optimum by exploring different regions of the search space.
8. Explain the simulated annealing algorithm and how it can potentially escape local optima compared to simple hill climbing.

Simulated Annealing is a metaheuristic optimization algorithm that can be used for finding the global minimum of a function that may possess several local minima. Simulated Annealing works by first randomly picking a state. Then the algorithm calculates the cost of the state and then makes a neighbor of the state to calculate that cost as well. If the neighbor cost is better, than the new current state becomes the new neighbor. However, simulated annealing adds a twist. Even if the neighbor cost is not better than the current state, you still have a probability of setting the current state to the new worse neighbor to try and dislodge yourself.

This probability is based on a temperature. At the beginning, the temperature is high so there is a better probability to dislodge yourself and explore the search space even if it may lead to worse results at first. As the algorithm iterates, the temperature starts to go down, so it slowly starts to look for better neighbors instead of just exploring and dislodging.

Simulated Annealing is thus better than simple hill climbing because simple hill climbing never goes to a state that may lead to worse results, so as a result gets stuck in local optima as described in the hill climbing algorithm, which SA doesn’t suffer from.

Supervised Learning: Classification, Regression, and Evaluation

Supervised learning is a type of machine learning where a computer is given access to a dataset of input-output pairs and learns a function that maps inputs to outputs. The computer uses the data to train its model and understand the relationships between inputs and outputs. The goal is for the AI to learn to predict outputs based on new input data.

Key aspects of supervised learning:
- Input-output pairs: The computer is provided with a dataset where each data point consists of an input and a corresponding desired output.
- Function mapping: The goal is to find a function that accurately maps inputs to outputs, allowing the computer to make predictions on new, unseen data.
- Training: The computer uses the provided data to train its model, adjusting its internal parameters to minimize the difference between its predictions and the actual outputs.
Classification and regression are two common tasks within supervised learning.
- Classification: Aims to map inputs into discrete categories. An example would be classifying a banknote as authentic or counterfeit based on its features.
- Regression: Aims to predict continuous output values. For example, predicting sales based on advertising spending.
Implementation and evaluation
- Libraries such as Scikit-learn in Python provide tools to implement supervised learning algorithms.
- The data is typically split into training and testing sets. The model is trained on the training set and evaluated on the testing set to assess its ability to generalize to new data.
- Holdout cross-validation splits the data into training and testing sets. The training set trains the machine learning model. The testing set tests how well the machine learning model performs.
- K-fold cross-validation divides data into k different sets and runs k different experiments.
Machine Learning: Algorithms, Techniques, and Applications

Machine learning involves enabling computers to learn from data and experiences without explicit instructions. Instead of programming a computer with explicit rules, machine learning allows the computer to learn patterns from data and improve its performance on a specific task.

Key aspects of machine learning:
- Learning from Data: Machine learning algorithms use data to identify patterns, make predictions, and improve decision-making.
- Algorithms and Techniques: Machine learning encompasses a wide range of algorithms and techniques that enable computers to learn from data.
- Pattern Recognition: Machine learning algorithms identify underlying patterns and relationships within data.
Machine learning comes in different forms, including supervised learning, reinforcement learning and unsupervised learning.
- Supervised learning involves training a model on a labeled dataset consisting of input-output pairs, enabling the model to learn a function that maps inputs to outputs.
- Reinforcement learning involves training an agent to make decisions in an environment to maximize a reward signal.
- Unsupervised learning involves discovering patterns and relationships in unlabeled data without explicit guidance. Clustering is a task preformed in unsupervised learning that involves organizing a set of objects into distinct clusters or groups of similar objects.
Neural networks are a popular tool in machine learning inspired by the structure of the human brain and can be very effective at certain tasks. A neural network is a mathematical model for learning inspired by biological neural networks. Artificial neural networks can model mathematical functions and learn network parameters.

TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Machine learning has a wide variety of applications including: recognizing faces in photos, playing games, understanding human language, spam detection, search and optimization problems, and more.

Neural Networks: Models, Training, and Applications

Neural networks are a popular tool in modern machine learning that draw inspiration from the way human brains learn and reason. They are a type of model that is effective at learning from some set of input data to figure out how to calculate some function from inputs to outputs.

Key aspects of neural networks:
- Mathematical Model: A neural network is a mathematical model for learning inspired by biological neural networks.
- Units: Instead of biological neurons, neural networks use units inside of the network. The units can be represented like nodes in a graph.
- Layers: Neural networks are composed of multiple layers of interconnected nodes or units, including an input layer, one or more hidden layers, and an output layer.
- Weights: Connections between units are defined by weights. The weights determine how signals are passed between connected nodes.
- Activation Functions: Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data.
- Backpropagation: Backpropagation is a key algorithm that makes training multi-layered neural networks possible. The backpropagation algorithm is used to adjust the weights in the network during training to minimize the difference between predicted and actual outputs.
- Versatility: Neural networks are versatile tools applicable to a number of domains.
There are different types of neural networks, each designed for specific tasks:
- Feed-forward neural networks have connections that only move in one direction. The inputs pass through hidden layers and ultimately produce an output.
- Convolutional neural networks (CNNs) are designed for processing grid-like data, such as images. CNNs apply convolutional layers and pooling layers to extract features from images.
- Recurrent neural networks (RNNs) are designed for processing sequential data, such as text or time series. RNNs have connections that loop back into themselves, allowing them to maintain a hidden state that captures information about the sequence. Long short-term memory (LSTM) neural network is a popular type of RNN.
Training Neural Networks:
- Gradient descent is a technique used to train neural networks by minimizing a loss function. Gradient descent involves iteratively adjusting the weights of the network based on the gradient of the loss function with respect to the weights.
- Stochastic gradient descent randomly chooses one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points.
- Mini-batch gradient descent divides the data set up into small batches, groups of data points, to calculate the gradient based on.
- Overfitting occurs when a neural network is too complex and fits the training data too closely, resulting in poor generalization to new data.
- Dropout is a technique used to combat overfitting by randomly removing units from the neural network during training.
TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Understanding Gradient Descent in Neural Networks

Gradient descent is an algorithm inspired by calculus for minimizing loss when training a neural network. In the context of neural networks, “loss” refers to how poorly a hypothesis function models data.

Key aspects of gradient descent:
- Loss Function: Gradient descent aims to minimize a loss function, which quantifies how poorly the neural network performs.
- Gradient Calculation: The algorithm calculates the gradient of the loss function with respect to the network’s weights. The gradient indicates the direction in which the weights should be adjusted to reduce the loss.
- Weight Update: The weights are updated by taking a small step in the direction opposite to the gradient. The size of this step can vary and is chosen when training the neural network.
- Iterative Process: This process is repeated iteratively, adjusting the weights little by little based on the data points, with the aim of converging towards a good solution.
There are variations to the standard gradient descent algorithm:
- Stochastic Gradient Descent: Instead of looking at all data points at once, stochastic gradient descent randomly chooses one data point at a time to calculate the gradient. This provides a less accurate gradient estimate but is faster to compute.
- Mini-Batch Gradient Descent: This approach is a middle ground between standard and stochastic gradient descent, where the data set is divided into small batches and the gradient is calculated based on these batches.
Understanding Neural Network Hidden Layers

Hidden layers are intermediate layers of artificial neurons or units within a neural network between the input layer and the output layer.

Here’s more about hidden layers and how they contribute to neural network functionality:
- Structure and Function In a neural network, the input layer receives the initial data, and the output layer produces the final result. The hidden layers lie in between, performing complex transformations on the input data to help the network learn non-linear relationships.
- Nodes and Connections Each hidden layer contains a certain number of nodes or units, each connected to the nodes in the preceding and following layers. The connections between nodes have weights, which are adjusted during training to optimize the network’s performance.
- Activation Each unit calculates its output value based on a linear combination of all the inputs. The advantage of layering like this gives an ability to model more complex functions.
Backpropagation: One of the challenges of neural networks is training neural networks that have hidden layers inside of them. The input data provides values for all of the inputs, and what the value of the output should be. However, the input data does not provide what the values for all of the nodes in the hidden layer should be. The key algorithm that makes training the hidden layers of neural networks possible is called backpropagation.

Deep Neural Networks: Neural networks that contain multiple hidden layers are called deep neural networks. The presence of multiple hidden layers allows the network to model more complex functions. Each layer can learn different features of the input, and these features can be combined to produce the desired output. However, complex networks are at greater risk of overfitting.

Dropout: Dropout is a technique that can combat overfitting in neural networks. It involves temporarily removing units from the network during training to prevent over-reliance on any single node.

Harvard CS50’s Artificial Intelligence with Python – Full University Course

The Original Text

This course from Harvard University explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. You’ll gain exposure to the theory behind graph search algorithms, classification, optimization, reinforcement learning, and other topics in artificial intelligence and machine learning. Brian Yu teaches this course. Hello, world. This is CS50, and this is an introduction to artificial intelligence with Python with CS50’s own Brian Yu. This course picks up where CS50 itself leaves off and explores the concepts and algorithms at the foundation of modern AI. We’ll start with a look at how AI can search for solutions to problems, whether those problems are learning how to play a game or trying to find driving directions to a destination. We’ll then look at how AI can represent information, both knowledge that our AI is certain about, but also information and events about which our AI might be uncertain, learning how to represent that information, but more importantly, how to use that information to draw inferences and new conclusions as well. We’ll explore how AI can solve various types of optimization problems, trying to maximize profits or minimize costs or satisfy some other constraints before turning our attention to the fast-growing field of machine learning, where we won’t tell our AI exactly how to solve a problem, but instead, give our AI access to data and experiences so that our AI can learn on its own how to perform these tasks. In particular, we’ll look at neural networks, one of the most popular tools in modern machine learning, inspired by the way that human brains learn and reason as well before finally taking a look at the world of natural language processing so that it’s not just us humans learning to learn how artificial intelligence is able to speak, but also AI learning how to understand and interpret human language as well. We’ll explore these ideas and algorithms, and along the way, give you the opportunity to build your own AI programs to implement all of this and more. This is CS50. All right. Welcome, everyone, to an introduction to artificial intelligence with Python. My name is Brian Yu, and in this class, we’ll explore some of the ideas and techniques and algorithms that are at the foundation of artificial intelligence. Now, artificial intelligence covers a wide variety of types of techniques. Anytime you see a computer do something that appears to be intelligent or rational in some way, like recognizing someone’s face in a photo, or being able to play a game better than people can, or being able to understand human language when we talk to our phones and they understand what we mean and are able to respond back to us, these are all examples of AI, or artificial intelligence. And in this class, we’ll explore some of the ideas that make that AI possible. So we’ll begin our conversations with search, the problem of we have an AI, and we would like the AI to be able to search for solutions to some kind of problem, no matter what that problem might be. Whether it’s trying to get driving directions from point A to point B, or trying to figure out how to play a game, given a tic-tac-toe game, for example, figuring out what move it ought to make. After that, we’ll take a look at knowledge. Ideally, we want our AI to be able to know information, to be able to represent that information, and more importantly, to be able to draw inferences from that information, to be able to use the information it knows and draw additional conclusions. So we’ll talk about how AI can be programmed in order to do just that. Then we’ll explore the topic of uncertainty, talking about ideas of what happens if a computer isn’t sure about a fact, but maybe is only sure with a certain probability. So we’ll talk about some of the ideas behind probability, and how computers can begin to deal with uncertain events in order to be a little bit more intelligent in that sense as well. After that, we’ll turn our attention to optimization, problems of when the computer is trying to optimize for some sort of goal, especially in a situation where there might be multiple ways that a computer might solve a problem, but we’re looking for a better way, or potentially the best way, if that’s at all possible. Then we’ll take a look at machine learning, or learning more generally, and looking at how, when we have access to data, our computers can be programmed to be quite intelligent by learning from data and learning from experience, being able to perform a task better and better based on greater access to data. So your email, for example, where your email inbox somehow knows which of your emails are good emails and which of your emails are spam. These are all examples of computers being able to learn from past experiences and past data. We’ll take a look, too, at how computers are able to draw inspiration from human intelligence, looking at the structure of the human brain, and how neural networks can be a computer analog to that sort of idea, and how, by taking advantage of a certain type of structure of a computer program, we can write neural networks that are able to perform tasks very, very effectively. And then finally, we’ll turn our attention to language, not programming languages, but human languages that we speak every day. And taking a look at the challenges that come about as a computer tries to understand natural language, and how it is some of the natural language processing that occurs in modern artificial intelligence can actually work. But today, we’ll begin our conversation with search, this problem of trying to figure out what to do when we have some sort of situation that the computer is in, some sort of environment that an agent is in, so to speak, and we would like for that agent to be able to somehow look for a solution to that problem. Now, these problems can come in any number of different types of formats. One example, for instance, might be something like this classic 15 puzzle with the sliding tiles that you might have seen. Where you’re trying to slide the tiles in order to make sure that all the numbers line up in order. This is an example of what you might call a search problem. The 15 puzzle begins in an initially mixed up state, and we need some way of finding moves to make in order to return the puzzle to its solved state. But there are similar problems that you can frame in other ways. Trying to find your way through a maze, for example, is another example of a search problem. You begin in one place, you have some goal of where you’re trying to get to, and you need to figure out the correct sequence of actions that will take you from that initial state to the goal. And while this is a little bit abstract, any time we talk about maze solving in this class, you can translate it to something a little more real world. Something like driving directions. If you ever wonder how Google Maps is able to figure out what is the best way for you to get from point A to point B, and what turns to make at what time, depending on traffic, for example, it’s often some sort of search algorithm. You have an AI that is trying to get from an initial position to some sort of goal by taking some sequence of actions. So we’ll start our conversations today by thinking about these types of search problems and what goes in to solving a search problem like this in order for an AI to be able to find a good solution. In order to do so, though, we’re going to need to introduce a little bit of terminology, some of which I’ve already used. But the first term we’ll need to think about is an agent. An agent is just some entity that perceives its environment. It somehow is able to perceive the things around it and act on that environment in some way. So in the case of the driving directions, your agent might be some representation of a car that is trying to figure out what actions to take in order to arrive at a destination. In the case of the 15 puzzle with the sliding tiles, the agent might be the AI or the person that is trying to solve that puzzle to try and figure out what tiles to move in order to get to that solution. Next, we introduce the idea of a state. A state is just some configuration of the agent in its environment. So in the 15 puzzle, for example, any state might be any one of these three, for example. A state is just some configuration of the tiles. And each of these states is different and is going to require a slightly different solution. A different sequence of actions will be needed in each one of these in order to get from this initial state to the goal, which is where we’re trying to get. So the initial state, then, what is that? The initial state is just the state where the agent begins. It is one such state where we’re going to start from. And this is going to be the starting point for our search algorithm, so to speak. We’re going to begin with this initial state and then start to reason about it, to think about what actions might we apply to that initial state in order to figure out how to get from the beginning to the end, from the initial position to whatever our goal happens to be. And how do we make our way from that initial position to the goal? Well, ultimately, it’s via taking actions. Actions are just choices that we can make in any given state. And in AI, we’re always going to try to formalize these ideas a little bit more precisely, such that we could program them a little bit more mathematically, so to speak. So this will be a recurring theme. And we can more precisely define actions as a function. We’re going to effectively define a function called actions that takes an input, s, where s is going to be some state that exists inside of our environment. And actions of s is going to take the state as input and return as output the set of all actions that can be executed in that state. And so it’s possible that some actions are only valid in certain states and not in other states. And we’ll see examples of that soon, too. So in the case of the 15 puzzle, for example, there are generally going to be four possible actions that we can do most of the time. We can slide a tile to the right, slide a tile to the left, slide a tile up, or slide a tile down, for example. And those are going to be the actions that are available to us. So somehow our AI, our program, needs some encoding of the state, which is often going to be in some numerical format, and some encoding of these actions. But it also needs some encoding of the relationship between these things. How do the states and actions relate to one another? And in order to do that, we’ll introduce to our AI a transition model, which will be a description of what state we get after we perform some available action in some other state. And again, we can be a little bit more precise about this, define this transition model a little bit more formally, again, as a function. The function is going to be a function called result that this time takes two inputs. Input number one is s, some state. And input number two is a, some action. And the output of this function result is it is going to give us the state that we get after we perform action a in state s. So let’s take a look at an example to see more precisely what this actually means. Here is an example of a state, of the 15 puzzle, for example. And here is an example of an action, sliding a tile to the right. What happens if we pass these as inputs to the result function? Again, the result function takes this board, this state, as its first input. And it takes an action as a second input. And of course, here, I’m describing things visually so that you can see visually what the state is and what the action is. In a computer, you might represent one of these actions as just some number that represents the action. Or if you’re familiar with enums that allow you to enumerate multiple possibilities, it might be something like that. And this state might just be represented as an array or two-dimensional array of all of these numbers that exist. But here, we’re going to show it visually just so you can see it. But when we take this state and this action, pass it into the result function, the output is a new state. The state we get after we take a tile and slide it to the right, and this is the state we get as a result. If we had a different action and a different state, for example, and pass that into the result function, we’d get a different answer altogether. So the result function needs to take care of figuring out how to take a state and take an action and get what results. And this is going to be our transition model that describes how it is that states and actions are related to each other. If we take this transition model and think about it more generally and across the entire problem, we can form what we might call a state space. The set of all of the states we can get from the initial state via any sequence of actions, by taking 0 or 1 or 2 or more actions in addition to that, so we could draw a diagram that looks something like this, where every state is represented here by a game board, and there are arrows that connect every state to every other state we can get to from that state. And the state space is much larger than what you see just here. This is just a sample of what the state space might actually look like. And in general, across many search problems, whether they’re this particular 15 puzzle or driving directions or something else, the state space is going to look something like this. We have individual states and arrows that are connecting them. And oftentimes, just for simplicity, we’ll simplify our representation of this entire thing as a graph, some sequence of nodes and edges that connect nodes. But you can think of this more abstract representation as the exact same idea. Each of these little circles or nodes is going to represent one of the states inside of our problem. And the arrows here represent the actions that we can take in any particular state, taking us from one particular state to another state, for example. All right. So now we have this idea of nodes that are representing these states, actions that can take us from one state to another, and a transition model that defines what happens after we take a particular action. So the next step we need to figure out is how we know when the AI is done solving the problem. The AI needs some way to know when it gets to the goal that it’s found the goal. So the next thing we’ll need to encode into our artificial intelligence is a goal test, some way to determine whether a given state is a goal state. In the case of something like driving directions, it might be pretty easy. If you’re in a state that corresponds to whatever the user typed in as their intended destination, well, then you know you’re in a goal state. In the 15 puzzle, it might be checking the numbers to make sure they’re all in ascending order. But the AI needs some way to encode whether or not any state they happen to be in is a goal. And some problems might have one goal, like a maze where you have one initial position and one ending position, and that’s the goal. In other more complex problems, you might imagine that there are multiple possible goals. That there are multiple ways to solve a problem, and we might not care which one the computer finds, as long as it does find a particular goal. However, sometimes the computer doesn’t just care about finding a goal, but finding a goal well, or one with a low cost. And it’s for that reason that the last piece of terminology that we’ll use to define these search problems is something called a path cost. You might imagine that in the case of driving directions, it would be pretty annoying if I said I wanted directions from point A to point B, and the route that Google Maps gave me was a long route with lots of detours that were unnecessary that took longer than it should have for me to get to that destination. And it’s for that reason that when we’re formulating search problems, we’ll often give every path some sort of numerical cost, some number telling us how expensive it is to take this particular option, and then tell our AI that instead of just finding a solution, some way of getting from the initial state to the goal, we’d really like to find one that minimizes this path cost. That is, less expensive, or takes less time, or minimizes some other numerical value. We can represent this graphically if we take a look at this graph again, and imagine that each of these arrows, each of these actions that we can take from one state to another state, has some sort of number associated with it. That number being the path cost of this particular action, where some of the costs for any particular action might be more expensive than the cost for some other action, for example. Although this will only happen in some sorts of problems. In other problems, we can simplify the diagram and just assume that the cost of any particular action is the same. And this is probably the case in something like the 15 puzzle, for example, where it doesn’t really make a difference whether I’m moving right or moving left. The only thing that matters is the total number of steps that I have to take to get from point A to point B. And each of those steps is of equal cost. We can just assume it’s of some constant cost like one. And so this now forms the basis for what we might consider to be a search problem. A search problem has some sort of initial state, some place where we begin, some sort of action that we can take or multiple actions that we can take in any given state. And it has a transition model. Some way of defining what happens when we go from one state and take one action, what state do we end up with as a result. In addition to that, we need some goal test to know whether or not we’ve reached a goal. And then we need a path cost function that tells us for any particular path, by following some sequence of actions, how expensive is that path. What does its cost in terms of money or time or some other resource that we are trying to minimize our usage of. And the goal ultimately is to find a solution. Where a solution in this case is just some sequence of actions that will take us from the initial state to the goal state. And ideally, we’d like to find not just any solution but the optimal solution, which is a solution that has the lowest path cost among all of the possible solutions. And in some cases, there might be multiple optimal solutions. But an optimal solution just means that there is no way that we could have done better in terms of finding that solution. So now we’ve defined the problem. And now we need to begin to figure out how it is that we’re going to solve this kind of search problem. And in order to do so, you’ll probably imagine that our computer is going to need to represent a whole bunch of data about this particular problem. We need to represent data about where we are in the problem. And we might need to be considering multiple different options at once. And oftentimes, when we’re trying to package a whole bunch of data related to a state together, we’ll do so using a data structure that we’re going to call a node. A node is a data structure that is just going to keep track of a variety of different values. And specifically, in the case of a search problem, it’s going to keep track of these four values in particular. Every node is going to keep track of a state, the state we’re currently on. And every node is also going to keep track of a parent. A parent being the state before us or the node that we used in order to get to this current state. And this is going to be relevant because eventually, once we reach the goal node, once we get to the end, we want to know what sequence of actions we use in order to get to that goal. And the way we’ll know that is by looking at these parents to keep track of what led us to the goal and what led us to that state and what led us to the state before that, so on and so forth, backtracking our way to the beginning so that we know the entire sequence of actions we needed in order to get from the beginning to the end. The node is also going to keep track of what action we took in order to get from the parent to the current state. And the node is also going to keep track of a path cost. In other words, it’s going to keep track of the number that represents how long it took to get from the initial state to the state that we currently happen to be at. And we’ll see why this is relevant as we start to talk about some of the optimizations that we can make in terms of these search problems more generally. So this is the data structure that we’re going to use in order to solve the problem. And now let’s talk about the approach. How might we actually begin to solve the problem? Well, as you might imagine, what we’re going to do is we’re going to start at one particular state, and we’re just going to explore from there. The intuition is that from a given state, we have multiple options that we could take, and we’re going to explore those options. And once we explore those options, we’ll find that more options than that are going to make themselves available. And we’re going to consider all of the available options to be stored inside of a single data structure that we’ll call the frontier. The frontier is going to represent all of the things that we could explore next that we haven’t yet explored or visited. So in our approach, we’re going to begin the search algorithm by starting with a frontier that just contains one state. The frontier is going to contain the initial state, because at the beginning, that’s the only state we know about. That is the only state that exists. And then our search algorithm is effectively going to follow a loop. We’re going to repeat some process again and again and again. The first thing we’re going to do is if the frontier is empty, then there’s no solution. And we can report that there is no way to get to the goal. And that’s certainly possible. There are certain types of problems that an AI might try to explore and realize that there is no way to solve that problem. And that’s useful information for humans to know as well. So if ever the frontier is empty, that means there’s nothing left to explore. And we haven’t yet found a solution, so there is no solution. There’s nothing left to explore. Otherwise, what we’ll do is we’ll remove a node from the frontier. So right now at the beginning, the frontier just contains one node representing the initial state. But over time, the frontier might grow. It might contain multiple states. And so here, we’re just going to remove a single node from that frontier. If that node happens to be a goal, then we found a solution. So we remove a node from the frontier and ask ourselves, is this the goal? And we do that by applying the goal test that we talked about earlier, asking if we’re at the destination. Or asking if all the numbers of the 15 puzzle happen to be in order. So if the node contains the goal, we found a solution. Great. We’re done. And otherwise, what we’ll need to do is we’ll need to expand the node. And this is a term of art in artificial intelligence. To expand the node just means to look at all of the neighbors of that node. In other words, consider all of the possible actions that I could take from the state that this node is representing and what nodes could I get to from there. We’re going to take all of those nodes, the next nodes that I can get to from this current one I’m looking at, and add those to the frontier. And then we’ll repeat this process. So at a very high level, the idea is we start with a frontier that contains the initial state. And we’re constantly removing a node from the frontier, looking at where we can get to next and adding those nodes to the frontier, repeating this process over and over until either we remove a node from the frontier and it contains a goal, meaning we’ve solved the problem, or we run into a situation where the frontier is empty, at which point we’re left with no solution. So let’s actually try and take the pseudocode, put it into practice by taking a look at an example of a sample search problem. So right here, I have a sample graph. A is connected to B via this action. B is connected to nodes C and D. C is connected to E. D is connected to F. And what I’d like to do is have my AI find a path from A to E. We want to get from this initial state to this goal state. So how are we going to do that? Well, we’re going to start with a frontier that contains the initial state. This is going to represent our frontier. So our frontier initially will just contain A, that initial state where we’re going to begin. And now we’ll repeat this process. If the frontier is empty, no solution. That’s not a problem, because the frontier is not empty. So we’ll remove a node from the frontier as the one to consider next. There’s only one node in the frontier. So we’ll go ahead and remove it from the frontier. But now A, this initial node, this is the node we’re currently considering. We follow the next step. We ask ourselves, is this node the goal? No, it’s not. A is not the goal. E is the goal. So we don’t return the solution. So instead, we go to this last step, expand the node, and add the resulting nodes to the frontier. What does that mean? Well, it means take this state A and consider where we could get to next. And after A, what we could get to next is only B. So that’s what we get when we expand A. We find B. And we add B to the frontier. And now B is in the frontier. And we repeat the process again. We say, all right, the frontier is not empty. So let’s remove B from the frontier. B is now the node that we’re considering. We ask ourselves, is B the goal? No, it’s not. So we go ahead and expand B and add its resulting nodes to the frontier. What happens when we expand B? In other words, what nodes can we get to from B? Well, we can get to C and D. So we’ll go ahead and add C and D from the frontier. And now we have two nodes in the frontier, C and D. And we repeat the process again. We remove a node from the frontier. For now, I’ll do so arbitrarily just by picking C. We’ll see why later, how choosing which node you remove from the frontier is actually quite an important part of the algorithm. But for now, I’ll arbitrarily remove C, say it’s not the goal. So we’ll add E, the next one, to the frontier. Then let’s say I remove E from the frontier. And now I check I’m currently looking at state E. Is it a goal state? It is, because I’m trying to find a path from A to E. So I would return the goal. And that now would be the solution, that I’m now able to return the solution. And I have found a path from A to E. So this is the general idea, the general approach of this search algorithm, to follow these steps, constantly removing nodes from the frontier, until we’re able to find a solution. So the next question you might reasonably ask is, what could go wrong here? What are the potential problems with an approach like this? And here’s one example of a problem that could arise from this sort of approach. Imagine this same graph, same as before, with one change. The change being now, instead of just an arrow from A to B, we also have an arrow from B to A, meaning we can go in both directions. And this is true in something like the 15 puzzle, where when I slide a tile to the right, I could then slide a tile to the left to get back to the original position. I could go back and forth between A and B. And that’s what these double arrows symbolize, the idea that from one state, I can get to another, and then I can get back. And that’s true in many search problems. What’s going to happen if I try to apply the same approach now? Well, I’ll begin with A, same as before. And I’ll remove A from the frontier. And then I’ll consider where I can get to from A. And after A, the only place I can get to is B. So B goes into the frontier. Then I’ll say, all right, let’s take a look at B. That’s the only thing left in the frontier. Where can I get to from B? Before, it was just C and D. But now, because of that reverse arrow, I can get to A or C or D. So all three, A, C, and D, all of those now go into the frontier. They are places I can get to from B. And now I remove one from the frontier. And maybe I’m unlucky, and maybe I pick A. And now I’m looking at A again. And I consider, where can I get to from A? And from A, well, I can get to B. And now we start to see the problem. But if I’m not careful, I go from A to B, and then back to A, and then to B again. And I could be going in this infinite loop, where I never make any progress, because I’m constantly just going back and forth between two states that I’ve already seen. So what is the solution to this? We need some way to deal with this problem. And the way that we can deal with this problem is by somehow keeping track of what we’ve already explored. And the logic is going to be, well, if we’ve already explored the state, there’s no reason to go back to it. Once we’ve explored a state, don’t go back to it. Don’t bother adding it to the frontier. There’s no need to. So here’s going to be our revised approach, a better way to approach this sort of search problem. And it’s going to look very similar, just with a couple of modifications. We’ll start with a frontier that contains the initial state, same as before. But now we’ll start with another data structure, which will just be a set of nodes that we’ve already explored. So what are the states we’ve explored? Initially, it’s empty. We have an empty explored set. And now we repeat. If the frontier is empty, no solution, same as before. We remove a node from the frontier. We check to see if it’s a goal state, return the solution. None of this is any different so far. But now what we’re going to do is we’re going to add the node to the explored state. So if it happens to be the case that we remove a node from the frontier and it’s not the goal, we’ll add it to the explored set so that we know we’ve already explored it. We don’t need to go back to it again if it happens to come up later. And then the final step, we expand the node and we add the resulting nodes to the frontier. But before, we just always added the resulting nodes to the frontier. We’re going to be a little clever about it this time. We’re only going to add the nodes to the frontier if they aren’t already in the frontier and if they aren’t already in the explored set. So we’ll check both the frontier and the explored set, make sure that the node isn’t already in one of those two. And so long as it isn’t, then we’ll go ahead and add it to the frontier, but not otherwise. And so that revised approach is ultimately what’s going to help make sure that we don’t go back and forth between two nodes. Now, the one point that I’ve kind of glossed over here so far is this step here, removing a node from the frontier. Before, I just chose arbitrarily. Like, let’s just remove a node and that’s it. But it turns out it’s actually quite important how we decide to structure our frontier, how we add and how we remove our nodes. The frontier is a data structure and we need to make a choice about in what order are we going to be removing elements. And one of the simplest data structures for adding and removing elements is something called a stack. And a stack is a data structure that is a last in, first out data type, which means the last thing that I add to the frontier is going to be the first thing that I remove from the frontier. So the most recent thing to go into the stack or the frontier in this case is going to be the node that I explore. So let’s see what happens if I apply this stack-based approach to something like this problem, finding a path from A to E. What’s going to happen? Well, again, we’ll start with A and we’ll say, all right, let’s go ahead and look at A first. And then notice this time, we’ve added A to the explored set. A is something we’ve now explored. We have this data structure that’s keeping track. We then say from A, we can get to B. And all right, from B, what can we do? Well, from B, we can explore B and get to both C and D. So we added C and then D. So now, when we explore a node, we’re going to treat the frontier as a stack, last in, first out. D was the last one to come in. So we’ll go ahead and explore that next and say, all right, where can we get to from D? Well, we can get to F. And so all right, we’ll put F into the frontier. And now, because the frontier is a stack, F is the most recent thing that’s gone in the stack. So F is what we’ll explore next. We’ll explore F and say, all right, where can we get to from F? Well, we can’t get anywhere, so nothing gets added to the frontier. So now, what was the new most recent thing added to the frontier? Well, it’s now C, the only thing left in the frontier. We’ll explore that from which we can see, all right, from C, we can get to E. So E goes into the frontier. And then we say, all right, let’s look at E. And E is now the solution. And now, we’ve solved the problem. So when we treat the frontier like a stack, a last in, first out data structure, that’s the result we get. We go from A to B to D to F. And then we sort of backed up and went down to C and then E. And it’s important to get a visual sense for how this algorithm is working. We went very deep in this search tree, so to speak, all the way until the bottom where we hit a dead end. And then we effectively backed up and explored this other route that we didn’t try before. And it’s this going very deep in the search tree idea, this way the algorithm ends up working when we use a stack that we call this version of the algorithm depth first search. Depth first search is the search algorithm where we always explore the deepest node in the frontier. We keep going deeper and deeper through our search tree. And then if we hit a dead end, we back up and we try something else instead. But depth first search is just one of the possible search options that we could use. It turns out that there’s another algorithm called breadth first search, which behaves very similarly to depth first search with one difference. Instead of always exploring the deepest node in the search tree, the way the depth first search does, breadth first search is always going to explore the shallowest node in the frontier. So what does that mean? Well, it means that instead of using a stack which depth first search or DFS used, where the most recent item added to the frontier is the one we’ll explore next, in breadth first search or BFS, we’ll instead use a queue, where a queue is a first in first out data type, where the very first thing we add to the frontier is the first one we’ll explore and they effectively form a line or a queue, where the earlier you arrive in the frontier, the earlier you get explored. So what would that mean for the same exact problem, finding a path from A to E? Well, we start with A, same as before, then we’ll go ahead and have explored A and say, where can we get to from A? Well, from A, we can get to B, same as before. From B, same as before, we can get to C and D. So C and D get added to the frontier. This time, though, we added C to the frontier before D. So we’ll explore C first. So C gets explored. And from C, where can we get to? Well, we can get to E. So E gets added to the frontier. But because D was explored before E, we’ll look at D next. So we’ll explore D and say, where can we get to from D? We can get to F. And only then will we say, all right, now we can get to E. And so what breadth first search or BFS did is we started here, we looked at both C and D, and then we looked at E. Effectively, we’re looking at things one away from the initial state, then two away from the initial state, and only then, things that are three away from the initial state, unlike depth first search, which just went as deep as possible into the search tree until it hit a dead end and then ultimately had to back up. So these now are two different search algorithms that we could apply in order to try and solve a problem. And let’s take a look at how these would actually work in practice with something like maze solving, for example. So here’s an example of a maze. These empty cells represent places where our agent can move. These darkened gray cells represent walls that the agent can’t pass through. And ultimately, our agent, our AI, is going to try to find a way to get from position A to position B via some sequence of actions, where those actions are left, right, up, and down. What will depth first search do in this case? Well, depth first search will just follow one path. If it reaches a fork in the road where it has multiple different options, depth first search is just, in this case, going to choose one. That doesn’t a real preference. But it’s going to keep following one until it hits a dead end. And when it hits a dead end, depth first search effectively goes back to the last decision point and tries the other path, fully exhausting this entire path. And when it realizes that, OK, the goal is not here, then it turns its attention to this path. It goes as deep as possible. When it hits a dead end, it backs up and then tries this other path, keeps going as deep as possible down one particular path. And when it realizes that that’s a dead end, then it’ll back up, and then ultimately find its way to the goal. And maybe you got lucky, and maybe you made a different choice earlier on. But ultimately, this is how depth first search is going to work. It’s going to keep following until it hits a dead end. And when it hits a dead end, it backs up and looks for a different solution. And so one thing you might reasonably ask is, is this algorithm always going to work? Will it always actually find a way to get from the initial state? To the goal. And it turns out that as long as our maze is finite, as long as there are only finitely many spaces where we can travel, then, yes, depth first search is going to find a solution. Because eventually, it’ll just explore everything. If the maze happens to be infinite and there’s an infinite state space, which does exist in certain types of problems, then it’s a slightly different story. But as long as our maze has finitely many squares, we’re going to find a solution. The next question, though, that we want to ask is, is it going to be a good solution? Is it the optimal solution that we can find? And the answer there is not necessarily. And let’s take a look at an example of that. In this maze, for example, we’re again trying to find our way from A to B. And you notice here there are multiple possible solutions. We could go this way or we could go up in order to make our way from A to B. Now, if we’re lucky, depth first search will choose this way and get to B. But there’s no reason necessarily why depth first search would choose between going up or going to the right. It’s sort of an arbitrary decision point because both are going to be added to the frontier. And ultimately, if we get unlucky, depth first search might choose to explore this path first because it’s just a random choice at this point. It’ll explore, explore, explore. And it’ll eventually find the goal, this particular path, when in actuality there was a better path. There was a more optimal solution that used fewer steps, assuming we’re measuring the cost of a solution based on the number of steps that we need to take. So depth first search, if we’re unlucky, might end up not finding the best solution when a better solution is available. So that’s DFS, depth first search. How does BFS, or breadth first search, compare? How would it work in this particular situation? Well, the algorithm is going to look very different visually in terms of how BFS explores. Because BFS looks at shallower nodes first, the idea is going to be, BFS will first look at all of the nodes that are one away from the initial state. Look here and look here, for example, just at the two nodes that are immediately next to this initial state. Then it’ll explore nodes that are two away, looking at this state and that state, for example. Then it’ll explore nodes that are three away, this state and that state. Whereas depth first search just picked one path and kept following it, breadth first search, on the other hand, is taking the option of exploring all of the possible paths as kind of at the same time bouncing back between them, looking deeper and deeper at each one, but making sure to explore the shallower ones or the ones that are closer to the initial state earlier. So we’ll keep following this pattern, looking at things that are four away, looking at things that are five away, looking at things that are six away, until eventually we make our way to the goal. And in this case, it’s true we had to explore some states that ultimately didn’t lead us anywhere, but the path that we found to the goal was the optimal path. This is the shortest way that we could get to the goal. And so what might happen then in a larger maze? Well, let’s take a look at something like this and how breadth first search is going to behave. Well, breadth first search, again, we’ll just keep following the states until it receives a decision point. It could go either left or right. And while DFS just picked one and kept following that until it hit a dead end, BFS, on the other hand, will explore both. It’ll say look at this node, then this node, and it’ll look at this node, then that node. So on and so forth. And when it hits a decision point here, rather than pick one left or two right and explore that path, it’ll again explore both, alternating between them, going deeper and deeper. We’ll explore here, and then maybe here and here, and then keep going. Explore here and slowly make our way, you can visually see, further and further out. Once we get to this decision point, we’ll explore both up and down until ultimately we make our way to the goal. And what you’ll notice is, yes, breadth first search did find our way from A to B by following this particular path, but it needed to explore a lot of states in order to do so. And so we see some trade offs here between DFS and BFS, that in DFS, there may be some cases where there is some memory savings as compared to a breadth first approach, where breadth first search in this case had to explore a lot of states. But maybe that won’t always be the case. So now let’s actually turn our attention to some code and look at the code that we could actually write in order to implement something like depth first search or breadth first search in the context of solving a maze, for example. So I’ll go ahead and go into my terminal. And what I have here inside of maze.py is an implementation of this same idea of maze solving. I’ve defined a class called node that in this case is keeping track of the state, the parent, in other words, the state before the state, and the action. In this case, we’re not keeping track of the path cost because we can calculate the cost of the path at the end after we found our way from the initial state to the goal. In addition to this, I’ve defined a class called a stack frontier. And if unfamiliar with a class, a class is a way for me to define a way to generate objects in Python. It refers to an idea of object oriented programming, where the idea here is that I would like to create an object that is able to store all of my frontier data. And I would like to have functions, otherwise known as methods, on that object that I can use to manipulate the object. And so what’s going on here, if unfamiliar with the syntax, is I have a function that initially creates a frontier that I’m going to represent using a list. And initially, my frontier is represented by the empty list. There’s nothing in my frontier to begin with. I have an add function that adds something to the frontier as by appending it to the end of the list. I have a function that checks if the frontier contains a particular state. I have an empty function that checks if the frontier is empty. If the frontier is empty, that just means the length of the frontier is 0. And then I have a function for removing something from the frontier. I can’t remove something from the frontier if the frontier is empty, so I check for that first. But otherwise, if the frontier isn’t empty, recall that I’m implementing this frontier as a stack, a last in first out data structure, which means the last thing I add to the frontier, in other words, the last thing in the list, is the item that I should remove from this frontier. So what you’ll see here is I have removed the last item of a list. And if you index into a Python list with negative 1, that gets you the last item in the list. Since 0 is the first item, negative 1 kind of wraps around and gets you to the last item in the list. So we give that the node. We call that node. We update the frontier here on line 28 to say, go ahead and remove that node that you just removed from the frontier. And then we return the node as a result. So this class here effectively implements the idea of a frontier. It gives me a way to add something to a frontier and a way to remove something from the frontier as a stack. I’ve also, just for good measure, implemented an alternative version of the same thing called a queue frontier, which in parentheses you’ll see here, it inherits from a stack frontier, meaning it’s going to do all the same things that the stack frontier did, except the way we remove a node from the frontier is going to be slightly different. Instead of removing from the end of the list the way we would in a stack, we’re instead going to remove from the beginning of the list. Self.frontier 0 will get me the first node in the frontier, the first one that was added, and that is going to be the one that we return in the case of a queue. Then under here, I have a definition of a class called maze. This is going to handle the process of taking a sequence, a maze-like text file, and figuring out how to solve it. So it will take as input a text file that looks something like this, for example, where we see hash marks that are here representing walls, and I have the character A representing the starting position and the character B representing the ending position. And you can take a look at the code for parsing this text file right now. That’s the less interesting part. The more interesting part is this solve function here, the solve function is going to figure out how to actually get from point A to point B. And here we see an implementation of the exact same idea we saw from a moment ago. We’re going to keep track of how many states we’ve explored, just so we can report that data later. But I start with a node that represents just the start state. And I start with a frontier that, in this case, is a stack frontier. And given that I’m treating my frontier as a stack, you might imagine that the algorithm I’m using here is now depth-first search, because depth-first search, or DFS, uses a stack as its data structure. And initially, this frontier is just going to contain the start state. We initialize an explored set that initially is empty. There’s nothing we’ve explored so far. And now here’s our loop, that notion of repeating something again and again. First, we check if the frontier is empty by calling that empty function that we saw the implementation of a moment ago. And if the frontier is indeed empty, we’ll go ahead and raise an exception, or a Python error, to say, sorry, there is no solution to this problem. Otherwise, we’ll go ahead and remove a node from the frontier as by calling frontier.remove and update the number of states we’ve explored, because now we’ve explored one additional state. So we say self.numexplored plus equals 1, adding 1 to the number of states we’ve explored. Once we remove a node from the frontier, recall that the next step is to see whether or not it’s the goal, the goal test. And in the case of the maze, the goal is pretty easy. I check to see whether the state of the node is equal to the goal. Initially, when I set up the maze, I set up this value called goal, which is a property of the maze, so I can just check to see if the node is actually the goal. And if it is the goal, then what I want to do is backtrack my way towards figuring out what actions I took in order to get to this goal. And how do I do that? We’ll recall that every node stores its parent, the node that came before it that we used to get to this node, and also the action used in order to get there. So I can create this loop where I’m constantly just looking at the parent of every node and keeping track for all of the parents what action I took to get from the parent to this current node. So this loop is going to keep repeating this process of looking through all of the parent nodes until we get back to the initial state, which has no parent, where node.parent is going to be equal to none. As I do so, I’m going to be building up the list of all of the actions that I’m following and the list of all the cells that are part of the solution. But I’ll reverse them because when I build it up, going from the goal back to the initial state and building the sequence of actions from the goal to the initial state, but I want to reverse them in order to get the sequence of actions from the initial state to the goal. And that is ultimately going to be the solution. So all of that happens if the current state is equal to the goal. And otherwise, if it’s not the goal, well, then I’ll go ahead and add this state to the explored set to say, I’ve explored this state now. No need to go back to it if I come across it in the future. And then this logic here implements the idea of adding neighbors to the frontier. I’m saying, look at all of my neighbors, and I implemented a function called neighbors that you can take a look at. And for each of those neighbors, I’m going to check, is the state already in the frontier? Is the state already in the explored set? And if it’s not in either of those, then I’ll go ahead and add this new child node, this new node, to the frontier. So there’s a fair amount of syntax here, but the key here is not to understand all the nuances of the syntax. So feel free to take a closer look at this file on your own to get a sense for how it is working. But the key is to see how this is an implementation of the same pseudocode, the same idea that we were describing a moment ago on the screen when we were looking at the steps that we might follow in order to solve this kind of search problem. So now let’s actually see this in action. I’ll go ahead and run maze.py on maze1.txt, for example. And what we’ll see is here, we have a printout of what the maze initially looked like. And then here down below is after we’ve solved it. We had to explore 11 states in order to do it, and we found a path from A to B. And in this program, I just happened to generate a graphical representation of this as well. So I can open up maze.png, which is generated by this program, that shows you where in the darker color here are the walls, red is the initial state, green is the goal, and yellow is the path that was followed. We found a path from the initial state to the goal. But now let’s take a look at a more sophisticated maze to see what might happen instead. Let’s look now at maze2.txt. We’re now here. We have a much larger maze. Again, we’re trying to find our way from point A to point B. But now you’ll imagine that depth-first search might not be so lucky. It might not get the goal on the first try. It might have to follow one path, then backtrack and explore something else a little bit later. So let’s try this. We’ll run python maze.py of maze2.txt, this time trying on this other maze. And now, depth-first search is able to find a solution. Here, as indicated by the stars, is a way to get from A to B. And we can represent this visually by opening up this maze. Here’s what that maze looks like, and highlighted in yellow is the path that was found from the initial state to the goal. But how many states did we have to explore before we found that path? Well, recall that in my program, I was keeping track of the number of states that we’ve explored so far. And so I can go back to the terminal and see that, all right, in order to solve this problem, we had to explore 399 different states. And in fact, if I make one small modification of the program and tell the program at the end when we output this image, I added an argument called show explored. And if I set show explored equal to true and rerun this program, python maze.py, running it on maze2, and then I open the maze, what you’ll see here is highlighted in red are all of the states that had to be explored to get from the initial state to the goal. Depth-first search, or DFS, didn’t find its way to the goal right away. It made a choice to first explore this direction. And when it explored this direction, it had to follow every conceivable path all the way to the very end, even this long and winding one, in order to realize that, you know what? That’s a dead end. And instead, the program needed to backtrack. After going this direction, it must have gone this direction. It got lucky here by just not choosing this path, but it got unlucky here, exploring this direction, exploring a bunch of states it didn’t need to, and then likewise exploring all of this top part of the graph when it probably didn’t need to do that either. So all in all, depth-first search here really not performing optimally, or probably exploring more states than it needs to. It finds an optimal solution, the best path to the goal, but the number of states needed to explore in order to do so, the number of steps I had to take, that was much higher. So let’s compare. How would breadth-first search, or BFS, do on this exact same maze instead? And in order to do so, it’s a very easy change. The algorithm for DFS and BFS is identical with the exception of what data structure we use to represent the frontier, that in DFS, I used a stack frontier, last in, first out, whereas in BFS, I’m going to use a queue frontier, first in, first out, where the first thing I add to the frontier is the first thing that I remove. So I’ll go back to the terminal, rerun this program on the same maze, and now you’ll see that the number of states we had to explore was only 77 as compared to almost 400 when we used depth-first search. And we can see exactly why. We can see what happened if we open up maze.png now and take a look. Again, yellow highlight is the solution that breadth-first search found, which incidentally is the same solution that depth-first search found. They’re both finding the best solution. But notice all the white unexplored cells. There was much fewer states that needed to be explored in order to make our way to the goal because breadth-first search operates a little more shallowly. It’s exploring things that are close to the initial state without exploring things that are further away. So if the goal is not too far away, then breadth-first search can actually behave quite effectively on a maze that looks a little something like this. Now, in this case, both BFS and DFS ended up finding the same solution, but that won’t always be the case. And in fact, let’s take a look at one more example. For instance, maze3.txt. In maze3.txt, notice that here there are multiple ways that you could get from A to B. It’s a relatively small maze, but let’s look at what happens. If I use, and I’ll go ahead and turn off show explored so we just see the solution. If I use BFS, breadth-first search, to solve maze3.txt, well, then we find a solution, and if I open up the maze, here is the solution that we found. It is the optimal one. With just four steps, we can get from the initial state to what the goal happens to be. But what happens if we tried to use depth-first search or DFS instead? Well, again, I’ll go back up to my Q frontier, where Q frontier means that we’re using breadth-first search, and I’ll change it to a stack frontier, which means that now we’ll be using depth-first search. I’ll rerun pythonmaze.py, and now you’ll see that we find the solution, but it is not the optimal solution. This instead is what our algorithm finds, and maybe depth-first search would have found the solution. It’s possible, but it’s not guaranteed that if we just happen to be unlucky, if we choose this state instead of that state, then depth-first search might find a longer route to get from the initial state to the goal. So we do see some trade-offs here, where depth-first search might not find the optimal solution. So at that point, it seems like breadth-first search is pretty good. Is that the best we can do, where it’s going to find us the optimal solution, and we don’t have to worry about situations where we might end up finding a longer path to the solution than what actually exists? Where the goal is far away from the initial state, and we might have to take lots of steps in order to get from the initial state to the goal, what ended up happening is that this algorithm, BFS, ended up exploring basically the entire graph, having to go through the entire maze in order to find its way from the initial state to the goal state. What we’d ultimately like is for our algorithm to be a little bit more intelligent. And now what would it mean for our algorithm to be a little bit more intelligent in this case? Well, let’s look back to where breadth-first search might have been able to make a different decision and consider human intuition in this process as well. What might a human do when solving this maze that is different than what BFS ultimately chose to do? Well, the very first decision point that BFS made was right here, when it made five steps and ended up in a position where it had a fork in the row. It could either go left or it could go right. In these initial couple steps, there was no choice. There was only one action that could be taken from each of those states. And so the search algorithm did the only thing that any search algorithm could do, which is keep following that state after the next state. But this decision point is where things get a little bit interesting. Depth-first search, that very first search algorithm we looked at, chose to say, let’s pick one path and exhaust that path. See if anything that way has the goal. And if not, then let’s try the other way. Depth-first search took the alternative approach of saying, you know what, let’s explore things that are shallow, close to us first. Look left and right, then back left and back right, so on and so forth, alternating between our options in the hopes of finding something nearby. But ultimately, what might a human do if confronted with a situation like this of go left or go right? Well, a human might visually see that, all right, I’m trying to get to state b, which is way up there, and going right just feels like it’s closer to the goal. It feels like going right should be better than going left because I’m making progress towards getting to that goal. Now, of course, there are a couple of assumptions that I’m making here. I’m making the assumption that we can represent this grid as like a two-dimensional grid where I know the coordinates of everything. I know that a is in coordinate 0, 0, and b is in some other coordinate pair, and I know what coordinate I’m at now. So I can calculate that, yeah, going this way, that is closer to the goal. And that might be a reasonable assumption for some types of search problems, but maybe not in others. But for now, we’ll go ahead and assume that, that I know what my current coordinate pair is, and I know the coordinate, x, y, of the goal that I’m trying to get to. And in this situation, I’d like an algorithm that is a little bit more intelligent, that somehow knows that I should be making progress towards the goal, and this is probably the way to do that because in a maze, moving in the coordinate direction of the goal is usually, though not always, a good thing. And so here we draw a distinction between two different types of search algorithms, uninformed search and informed search. Uninformed search algorithms are algorithms like DFS and BFS, the two algorithms that we just looked at, which are search strategies that don’t use any problem-specific knowledge to be able to solve the problem. DFS and BFS didn’t really care about the structure of the maze or anything about the way that a maze is in order to solve the problem. They just look at the actions available and choose from those actions, and it doesn’t matter whether it’s a maze or some other problem, the solution or the way that it tries to solve the problem is really fundamentally going to be the same. What we’re going to take a look at now is an improvement upon uninformed search. We’re going to take a look at informed search. Informed search are going to be search strategies that use knowledge specific to the problem to be able to better find a solution. And in the case of a maze, this problem-specific knowledge is something like if I’m in a square that is geographically closer to the goal, that is better than being in a square that is geographically further away. And this is something we can only know by thinking about this problem and reasoning about what knowledge might be helpful for our AI agent to know a little something about. There are a number of different types of informed search. Specifically, first, we’re going to look at a particular type of search algorithm called greedy best-first search. Greedy best-first search, often abbreviated G-BFS, is a search algorithm that instead of expanding the deepest node like DFS or the shallowest node like BFS, this algorithm is always going to expand the node that it thinks is closest to the goal. Now, the search algorithm isn’t going to know for sure whether it is the closest thing to the goal. Because if we knew what was closest to the goal all the time, then we would already have a solution. The knowledge of what is close to the goal, we could just follow those steps in order to get from the initial position to the solution. But if we don’t know the solution, meaning we don’t know exactly what’s closest to the goal, instead we can use an estimate of what’s closest to the goal, otherwise known as a heuristic, just some way of estimating whether or not we’re close to the goal. And we’ll do so using a heuristic function conventionally called h of n that takes a status input and returns our estimate of how close we are to the goal. So what might this heuristic function actually look like in the case of a maze solving algorithm? Where we’re trying to solve a maze, what does the heuristic look like? Well, the heuristic needs to answer a question between these two cells, C and D, which one is better? Which one would I rather be in if I’m trying to find my way to the goal? Well, any human could probably look at this and tell you, you know what, D looks like it’s better. Even if the maze is convoluted and you haven’t thought about all the walls, D is probably better. And why is D better? Well, because if you ignore the wall, so let’s just pretend the walls don’t exist for a moment and relax the problem, so to speak, D, just in terms of coordinate pairs, is closer to this goal. It’s fewer steps that I wouldn’t take to get to the goal as compared to C, even if you ignore the walls. If you just know the xy-coordinate of C and the xy-coordinate of the goal, and likewise you know the xy-coordinate of D, you can calculate the D just geographically. Ignoring the walls looks like it’s better. And so this is the heuristic function that we’re going to use. And it’s something called the Manhattan distance, one specific type of heuristic, where the heuristic is how many squares vertically and horizontally and then left to right, so not allowing myself to go diagonally, just either up or right or left or down. How many steps do I need to take to get from each of these cells to the goal? Well, as it turns out, D is much closer. There are fewer steps. It only needs to take six steps in order to get to that goal. Again, here, ignoring the walls. We’ve relaxed the problem a little bit. We’re just concerned with if you do the math to subtract the x values from each other and the y values from each other, what is our estimate of how far we are away? We can estimate the D is closer to the goal than C is. And so now we have an approach. We have a way of picking which node to remove from the frontier. And at each stage in our algorithm, we’re going to remove a node from the frontier. We’re going to explore the node if it has the smallest value for this heuristic function, if it has the smallest Manhattan distance to the goal. And so what would this actually look like? Well, let me first label this graph, label this maze, with a number representing the value of this heuristic function, the value of the Manhattan distance from any of these cells. So from this cell, for example, we’re one away from the goal. From this cell, we’re two away from the goal, three away, four away. Here, we’re five away because we have to go one to the right and then four up. From somewhere like here, the Manhattan distance is two. We’re only two squares away from the goal geographically, even though in practice, we’re going to have to take a longer path. But we don’t know that yet. The heuristic is just some easy way to estimate how far we are away from the goal. And maybe our heuristic is overly optimistic. It thinks that, yeah, we’re only two steps away. When in practice, when you consider the walls, it might be more steps. So the important thing here is that the heuristic isn’t a guarantee of how many steps it’s going to take. It is estimating. It’s an attempt at trying to approximate. And it does seem generally the case that the squares that look closer to the goal have smaller values for the heuristic function than squares that are further away. So now, using greedy best-first search, what might this algorithm actually do? Well, again, for these first five steps, there’s not much of a choice. We start at this initial state a, and we say, all right, we have to explore these five states. But now we have a decision point. Now we have a choice between going left and going right. And before, when DFS and BFS would just pick arbitrarily, because it just depends on the order you throw these two nodes into the frontier, and we didn’t specify what order you put them into the frontier, only the order you take them out, here we can look at 13 and 11 and say that, all right, this square is a distance of 11 away from the goal according to our heuristic, according to our estimate. And this one, we estimate to be 13 away from the goal. So between those two options, between these two choices, I’d rather have the 11. I’d rather be 11 steps away from the goal, so I’ll go to the right. We’re able to make an informed decision, because we know a little something more about this problem. So then we keep following, 10, 9, 8. Between the two 7s, we don’t really have much of a way to know between those. So then we do just have to make an arbitrary choice. And you know what, maybe we choose wrong. But that’s OK, because now we can still say, all right, let’s try this 7. We say 7, 6, we have to make this choice, even though it increases the value of the heuristic function. But now we have another decision point, between 6 and 8, and between those two. And really, we’re also considering this 13, but that’s much higher. Between 6, 8, and 13, well, the 6 is the smallest value, so we’d rather take the 6. We’re able to make an informed decision that going this way to the right is probably better than going down. So we turn this way, we go to 5. And now we find a decision point where we’ll actually make a decision that we might not want to make, but there’s unfortunately not too much of a way around this. We see 4 and 6. 4 looks closer to the goal, right? It’s going up, and the goal is further up. So we end up taking that route, which ultimately leads us to a dead end. But that’s OK, because we can still say, all right, now let’s try the 6. And now follow this route that will ultimately lead us to the goal. And so this now is how greedy best-for-search might try to approach this problem by saying, whenever we have a decision between multiple nodes that we could explore, let’s explore the node that has the smallest value of h of n, this heuristic function that is estimating how far I have to go. And it just so happens that in this case, we end up doing better in terms of the number of states we needed to explore than BFS needed to. BFS explored all of this section and all of that section, but we were able to eliminate that by taking advantage of this heuristic, this knowledge about how close we are to the goal or some estimate of that idea. So this seems much better. So wouldn’t we always prefer an algorithm like this over an algorithm like breadth-first search? Well, maybe one thing to take into consideration is that we need to come up with a good heuristic, how good the heuristic is, is going to affect how good this algorithm is. And coming up with a good heuristic can oftentimes be challenging. But the other thing to consider is to ask the question, just as we did with the prior two algorithms, is this algorithm optimal? Will it always find the shortest path from the initial state to the goal? And to answer that question, let’s take a look at this example for a moment. Take a look at this example. Again, we’re trying to get from A to B. And again, I’ve labeled each of the cells with their Manhattan distance from the goal. The number of squares up and to the right, you would need to travel in order to get from that square to the goal. And let’s think about, would greedy best-first search that always picks the smallest number end up finding the optimal solution? What is the shortest solution? And would this algorithm find it? And the important thing to realize is that right here is the decision point. We’re estimated to be 12 away from the goal. And we have two choices. We can go to the left, which we estimate to be 13 away from the goal. Or we can go up, where we estimate it to be 11 away from the goal. And between those two, greedy best-first search is going to say the 11 looks better than the 13. And in doing so, greedy best-first search will end up finding this path to the goal. But it turns out this path is not optimal. There is a way to get to the goal using fewer steps. And it’s actually this way, this way that ultimately involved fewer steps, even though it meant at this moment choosing the worst option between the two or what we estimated to be the worst option based on the heuristics. And so this is what we mean by this is a greedy algorithm. It’s making the best decision locally. At this decision point, it looks like it’s better to go here than it is to go to the 13. But in the big picture, it’s not necessarily optimal. That it might find a solution when in actuality, there was a better solution available. So we would like some way to solve this problem. We like the idea of this heuristic, of being able to estimate the path, the distance between us and the goal. And that helps us to be able to make better decisions and to eliminate having to search through entire parts of this state space. But we would like to modify the algorithm so that we can achieve optimality, so that it can be optimal. And what is the way to do this? What is the intuition here? Well, let’s take a look at this problem. In this initial problem, greedy best research found us this solution here, this long path. And the reason why it wasn’t great is because, yes, the heuristic numbers went down pretty low. But later on, they started to build back up. They built back 8, 9, 10, 11, all the way up to 12 in this case. And so how might we go about trying to improve this algorithm? Well, one thing that we might realize is that if we go all the way through this algorithm, through this path, and we end up going to the 12, and we’ve had to take this many steps, who knows how many steps that is, just to get to this 12, we could have also, as an alternative, taken much fewer steps, just six steps, and ended up at this 13 here. And yes, 13 is more than 12, so it looks like it’s not as good. But it required far fewer steps. It only took six steps to get to this 13 versus many more steps to get to this 12. And while greedy best research says, oh, well, 12 is better than 13, so pick the 12, we might more intelligently say, I’d rather be somewhere that heuristically looks like it takes slightly longer if I can get there much more quickly. And we’re going to encode that idea, this general idea, into a more formal algorithm known as A star search. A star search is going to solve this problem by instead of just considering the heuristic, also considering how long it took us to get to any particular state. So the distinction is greedy best for search. If I am in a state right now, the only thing I care about is, what is the estimated distance, the heuristic value, between me and the goal? Whereas A star search will take into consideration two pieces of information. It’ll take into consideration, how far do I estimate I am from the goal? But also, how far did I have to travel in order to get here? Because that is relevant, too. So we’ll search algorithms by expanding the node with the lowest value of g of n plus h of n. h of n is that same heuristic that we were talking about a moment ago that’s going to vary based on the problem. But g of n is going to be the cost to reach the node, how many steps I had to take, in this case, to get to my current position. So what does that search algorithm look like in practice? Well, let’s take a look. Again, we’ve got the same maze. And again, I’ve labeled them with their Manhattan distance. This value is the h of n value, the heuristic estimate of how far each of these squares is away from the goal. But now, as we begin to explore states, we care not just about this heuristic value, but also about g of n, the number of steps I had to take in order to get there. And I care about summing those two numbers together. So what does that look like? On this very first step, I have taken one step. And now I am estimated to be 16 steps away from the goal. So the total value here is 17. Then I take one more step. I’ve now taken two steps. And I estimate myself to be 15 away from the goal, again, a total value of 17. Now I’ve taken three steps. And I’m estimated to be 14 away from the goal, so on and so forth. Four steps, an estimate of 13. Five steps, estimate of 12. And now here’s a decision point. I could either be six steps away from the goal with a heuristic of 13 for a total of 19, or I could be six steps away from the goal with a heuristic of 11 with an estimate of 17 for the total. So between 19 and 17, I’d rather take the 17, the 6 plus 11. So so far, no different than what we saw before. We’re still taking this option because it appears to be better. And I keep taking this option because it appears to be better. But it’s right about here that things get a little bit different. Now I could be 15 steps away from the goal with an estimated distance of 6. So 15 plus 6, total value of 21. Alternatively, I could be six steps away from the goal, because this is five steps away, so this is six steps away, with a total value of 13 as my estimate. So 6 plus 13, that’s 19. So here, we would evaluate g of n plus h of n to be 19, 6 plus 13. Whereas here, we would be 15 plus 6, or 21. And so the intuition is 19 less than 21, pick here. But the idea is ultimately I’d rather be having taken fewer steps, get to a 13, than having taken 15 steps and be at a 6, because it means I’ve had to take more steps in order to get there. Maybe there’s a better path this way. So instead, we’ll explore this route. Now if we go one more, this is seven steps plus 14 is 21. So between those two, it’s sort of a toss-up. We might end up exploring that one anyways. But after that, as these numbers start to get bigger in the heuristic values, and these heuristic values start to get smaller, you’ll find that we’ll actually keep exploring down this path. And you can do the math to see that at every decision point, A star search is going to make a choice based on the sum of how many steps it took me to get to my current position, and then how far I estimate I am from the goal. So while we did have to explore some of these states, the ultimate solution we found was, in fact, an optimal solution. It did find us the quickest possible way to get from the initial state to the goal. And it turns out that A star is an optimal search algorithm under certain conditions. So the conditions are H of n, my heuristic, needs to be admissible. What does it mean for a heuristic to be admissible? Well, a heuristic is admissible if it never overestimates the true cost. H of n always needs to either get it exactly right in terms of how far away I am, or it needs to underestimate. So we saw an example from before where the heuristic value was much smaller than the actual cost it would take. That’s totally fine, but the heuristic value should never overestimate. It should never think that I’m further away from the goal than I actually am. And meanwhile, to make a stronger statement, H of n also needs to be consistent. And what does it mean for it to be consistent? Mathematically, it means that for every node, which we’ll call n, and successor, the node after me, that I’ll call n prime, where it takes a cost of C to make that step, the heuristic value of n needs to be less than or equal to the heuristic value of n prime plus the cost. So it’s a lot of math, but in words what that ultimately means is that if I am here at this state right now, the heuristic value from me to the goal shouldn’t be more than the heuristic value of my successor, the next place I could go to, plus however much it would cost me to just make that step from one step to the next step. And so this is just making sure that my heuristic is consistent between all of these steps that I might take. So as long as this is true, then A star search is going to find me an optimal solution. And this is where much of the challenge of solving these search problems can sometimes come in, that A star search is an algorithm that is known and you could write the code fairly easily, but it’s choosing the heuristic. It can be the interesting challenge. The better the heuristic is, the better I’ll be able to solve the problem in the fewer states that I’ll have to explore. And I need to make sure that the heuristic satisfies these particular constraints. So all in all, these are some of the examples of search algorithms that might work, and certainly there are many more than just this. A star, for example, does have a tendency to use quite a bit of memory. So there are alternative approaches to A star that ultimately use less memory than this version of A star happens to use, and there are other search algorithms that are optimized for other cases as well. But now so far, we’ve only been looking at search algorithms where there is one agent. I am trying to find a solution to a problem. I am trying to navigate my way through a maze. I am trying to solve a 15 puzzle. I am trying to find driving directions from point A to point B. Sometimes in search situations, though, we’ll enter an adversarial situation, where I am an agent trying to make intelligent decisions. And there’s someone else who is fighting against me, so to speak, that has opposite objectives, someone where I am trying to succeed, someone else that wants me to fail. And this is most popular in something like a game, a game like Tic Tac Toe, where we’ve got this 3 by 3 grid, and x and o take turns, either writing an x or an o in any one of these squares. And the goal is to get three x’s in a row if you’re the x player, or three o’s in a row if you’re the o player. And computers have gotten quite good at playing games, Tic Tac Toe very easily, but even more complex games. And so you might imagine, what does an intelligent decision in a game look like? So maybe x makes an initial move in the middle, and o plays up here. What does an intelligent move for x now become? Where should you move if you were x? And it turns out there are a couple of possibilities. But if an AI is playing this game optimally, then the AI might play somewhere like the upper right, where in this situation, o has the opposite objective of x. x is trying to win the game to get three in a row diagonally here. And o is trying to stop that objective, opposite of the objective. And so o is going to place here to try to block. But now, x has a pretty clever move. x can make a move like this, where now x has two possible ways that x can win the game. x could win the game by getting three in a row across here. Or x could win the game by getting three in a row vertically this way. So it doesn’t matter where o makes their next move. o could play here, for example, blocking the three in a row horizontally. But then x is going to win the game by getting a three in a row vertically. And so there’s a fair amount of reasoning that’s going on here in order for the computer to be able to solve a problem. And it’s similar in spirit to the problems we’ve looked at so far. There are actions. There’s some sort of state of the board and some transition from one action to the next. But it’s different in the sense that this is now not just a classical search problem, but an adversarial search problem. That I am at the x player trying to find the best moves to make, but I know that there is some adversary that is trying to stop me. So we need some sort of algorithm to deal with these adversarial type of search situations. And the algorithm we’re going to take a look at is an algorithm called Minimax, which works very well for these deterministic games where there are two players. It can work for other types of games as well. But we’ll look right now at games where I make a move, then my opponent makes a move. And I am trying to win, and my opponent is trying to win also. Or in other words, my opponent is trying to get me to lose. And so what do we need in order to make this algorithm work? Well, any time we try and translate this human concept of playing a game, winning and losing to a computer, we want to translate it in terms that the computer can understand. And ultimately, the computer really just understands the numbers. And so we want some way of translating a game of x’s and o’s on a grid to something numerical, something the computer can understand. The computer doesn’t normally understand notions of win or lose. But it does understand the concept of bigger and smaller. And so what we might do is we might take each of the possible ways that a tic-tac-toe game can unfold and assign a value or a utility to each one of those possible ways. And in a tic-tac-toe game, and in many types of games, there are three possible outcomes. The outcomes are o wins, x wins, or nobody wins. So player one wins, player two wins, or nobody wins. And for now, let’s go ahead and assign each of these possible outcomes a different value. We’ll say o winning, that’ll have a value of negative 1. Nobody winning, that’ll have a value of 0. And x winning, that will have a value of 1. So we’ve just assigned numbers to each of these three possible outcomes. And now we have two players, we have the x player and the o player. And we’re going to go ahead and call the x player the max player. And we’ll call the o player the min player. And the reason why is because in the min and max algorithm, the max player, which in this case is x, is aiming to maximize the score. These are the possible options for the score, negative 1, 0, and 1. x wants to maximize the score, meaning if at all possible, x would like this situation, where x wins the game, and we give it a score of 1. But if this isn’t possible, if x needs to choose between these two options, negative 1, meaning o winning, or 0, meaning nobody winning, x would rather that nobody wins, score of 0, than a score of negative 1, o winning. So this notion of winning and losing and tying has been reduced mathematically to just this idea of try and maximize the score. The x player always wants the score to be bigger. And on the flip side, the min player, in this case o, is aiming to minimize the score. The o player wants the score to be as small as possible. So now we’ve taken this game of x’s and o’s and winning and losing and turned it into something mathematical, something where x is trying to maximize the score, o is trying to minimize the score. Let’s now look at all of the parts of the game that we need in order to encode it in an AI so that an AI can play a game like tic-tac-toe. So the game is going to need a couple of things. We’ll need some sort of initial state that will, in this case, call s0, which is how the game begins, like an empty tic-tac-toe board, for example. We’ll also need a function called player, where the player function is going to take as input a state here represented by s. And the output of the player function is going to be which player’s turn is it. We need to be able to give a tic-tac-toe board to the computer, run it through a function, and that function tells us whose turn it is. We’ll need some notion of actions that we can take. We’ll see examples of that in just a moment. We need some notion of a transition model, same as before. If I have a state and I take an action, I need to know what results as a consequence of it. I need some way of knowing when the game is over. So this is equivalent to kind of like a goal test, but I need some terminal test, some way to check to see if a state is a terminal state, where a terminal state means the game is over. In a classic game of tic-tac-toe, a terminal state means either someone has gotten three in a row or all of the squares of the tic-tac-toe board are filled. Either of those conditions make it a terminal state. In a game of chess, it might be something like when there is checkmate or if checkmate is no longer possible, that that becomes a terminal state. And then finally, we’ll need a utility function, a function that takes a state and gives us a numerical value for that terminal state, some way of saying if x wins the game, that has a value of 1. If o is won the game, that has a value of negative 1. If nobody has won the game, that has a value of 0. So let’s take a look at each of these in turn. The initial state, we can just represent in tic-tac-toe as the empty game board. This is where we begin. It’s the place from which we begin this search. And again, I’ll be representing these things visually, but you can imagine this really just being like an array or a two-dimensional array of all of these possible squares. Then we need the player function that, again, takes a state and tells us whose turn it is. Assuming x makes the first move, if I have an empty game board, then my player function is going to return x. And if I have a game board where x has made a move, then my player function is going to return o. The player function takes a tic-tac-toe game board and tells us whose turn it is. Next up, we’ll consider the actions function. The actions function, much like it did in classical search, takes a state and gives us the set of all of the possible actions we can take in that state. So let’s imagine it’s o is turned to move in a game board that looks like this. What happens when we pass it into the actions function? So the actions function takes this state of the game as input, and the output is a set of possible actions. It’s a set of I could move in the upper left or I could move in the bottom middle. So those are the two possible action choices that I have when I begin in this particular state. Now, just as before, when we had states and actions, we need some sort of transition model to tell us when we take this action in the state, what is the new state that we get. And here, we define that using the result function that takes a state as input as well as an action. And when we apply the result function to this state, saying that let’s let o move in this upper left corner, the new state we get is this resulting state where o is in the upper left corner. And now, this seems obvious to someone who knows how to play tic-tac-toe. Of course, you play in the upper left corner. That’s the board you get. But all of this information needs to be encoded into the AI. The AI doesn’t know how to play tic-tac-toe until you tell the AI how the rules of tic-tac-toe work. And this function, defining this function here, allows us to tell the AI how this game actually works and how actions actually affect the outcome of the game. So the AI needs to know how the game works. The AI also needs to know when the game is over, as by defining a function called terminal that takes as input a state s, such that if we take a game that is not yet over, pass it into the terminal function, the output is false. The game is not over. But if we take a game that is over because x has gotten three in a row along that diagonal, pass that into the terminal function, then the output is going to be true because the game now is, in fact, over. And finally, we’ve told the AI how the game works in terms of what moves can be made and what happens when you make those moves. We’ve told the AI when the game is over. Now we need to tell the AI what the value of each of those states is. And we do that by defining this utility function that takes a state s and tells us the score or the utility of that state. So again, we said that if x wins the game, that utility is a value of 1, whereas if o wins the game, then the utility of that is negative 1. And the AI needs to know, for each of these terminal states where the game is over, what is the utility of that state? So if I give you a game board like this where the game is, in fact, over, and I ask the AI to tell me what the value of that state is, it could do so. The value of the state is 1. Where things get interesting, though, is if the game is not yet over. Let’s imagine a game board like this, where in the middle of the game, it’s o’s turn to make a move. So how do we know it’s o’s turn to make a move? We can calculate that using the player function. We can say player of s, pass in the state, o is the answer. So we know it’s o’s turn to move. And now, what is the value of this board and what action should o take? Well, that’s going to depend. We have to do some calculation here. And this is where the minimax algorithm really comes in. Recall that x is trying to maximize the score, which means that o is trying to minimize the score. So o would like to minimize the total value that we get at the end of the game. And because this game isn’t over yet, we don’t really know just yet what the value of this game board is. We have to do some calculation in order to figure that out. And so how do we do that kind of calculation? Well, in order to do so, we’re going to consider, just as we might in a classical search situation, what actions could happen next and what states will that take us to. And it turns out that in this position, there are only two open squares, which means there are only two open places where o can make a move. o could either make a move in the upper left or o can make a move in the bottom middle. And minimax doesn’t know right out of the box which of those moves is going to be better. So it’s going to consider both. But now, we sort of run into the same situation. Now, I have two more game boards, neither of which is over. What happens next? And now, it’s in this sense that minimax is what we’ll call a recursive algorithm. It’s going to now repeat the exact same process, although now considering it from the opposite perspective. It’s as if I am now going to put myself, if I am the o player, I’m going to put myself in my opponent’s shoes, my opponent as the x player, and consider what would my opponent do if they were in this position? What would my opponent do, the x player, if they were in that position? And what would then happen? Well, the other player, my opponent, the x player, is trying to maximize the score, whereas I am trying to minimize the score as the o player. So x is trying to find the maximum possible value that they can get. And so what’s going to happen? Well, from this board position, x only has one choice. x is going to play here, and they’re going to get three in a row. And we know that that board, x winning, that has a value of 1. If x wins the game, the value of that game board is 1. And so from this position, if this state can only ever lead to this state, it’s the only possible option, and this state has a value of 1, then the maximum possible value that the x player can get from this game board is also 1. From here, the only place we can get is to a game with a value of 1, so this game board also has a value of 1. Now we consider this one over here. What’s going to happen now? Well, x needs to make a move. The only move x can make is in the upper left, so x will go there. And in this game, no one wins the game. Nobody has three in a row. And so the value of that game board is 0. Nobody is 1. And so again, by the same logic, if from this board position the only place we can get to is a board where the value is 0, then this state must also have a value of 0. And now here comes the choice part, the idea of trying to minimize. I, as the o player, now know that if I make this choice moving in the upper left, that is going to result in a game with a value of 1, assuming everyone plays optimally. And if I instead play in the lower middle, choose this fork in the road, that is going to result in a game board with a value of 0. I have two options. I have a 1 and a 0 to choose from, and I need to pick. And as the min player, I would rather choose the option with the minimum value. So whenever a player has multiple choices, the min player will choose the option with the smallest value. The max player will choose the option with the largest value. Between the 1 and the 0, the 0 is smaller, meaning I’d rather tie the game than lose the game. And so this game board will say also has a value of 0, because if I am playing optimally, I will pick this fork in the road. I’ll place my o here to block x’s 3 in a row, x will move in the upper left, and the game will be over, and no one will have won the game. So this is now the logic of minimax, to consider all of the possible options that I can take, all of the actions that I can take, and then to put myself in my opponent’s shoes. I decide what move I’m going to make now by considering what move my opponent will make on the next turn. And to do that, I consider what move I would make on the turn after that, so on and so forth, until I get all the way down to the end of the game, to one of these so-called terminal states. In fact, this very decision point, where I am trying to decide as the o player what to make a decision about, might have just been a part of the logic that the x player, my opponent, was using, the move before me. This might be part of some larger tree, where x is trying to make a move in this situation, and needs to pick between three different options in order to make a decision about what to happen. And the further and further away we are from the end of the game, the deeper this tree has to go. Because every level in this tree is going to correspond to one move, one move or action that I take, one move or action that my opponent takes, in order to decide what happens. And in fact, it turns out that if I am the x player in this position, and I recursively do the logic, and see I have a choice, three choices, in fact, one of which leads to a value of 0. If I play here, and if everyone plays optimally, the game will be a tie. If I play here, then o is going to win, and I’ll lose playing optimally. Or here, where I, the x player, can win, well between a score of 0, and negative 1, and 1, I’d rather pick the board with a value of 1, because that’s the maximum value I can get. And so this board would also have a maximum value of 1. And so this tree can get very, very deep, especially as the game starts to have more and more moves. And this logic works not just for tic-tac-toe, but any of these sorts of games, where I make a move, my opponent makes a move, and ultimately, we have these adversarial objectives. And we can simplify the diagram into a diagram that looks like this. This is a more abstract version of the minimax tree, where these are each states, but I’m no longer representing them as exactly like tic-tac-toe boards. This is just representing some generic game that might be tic-tac-toe, might be some other game altogether. Any of these green arrows that are pointing up, that represents a maximizing state. I would like the score to be as big as possible. And any of these red arrows pointing down, those are minimizing states, where the player is the min player, and they are trying to make the score as small as possible. So if you imagine in this situation, I am the maximizing player, this player here, and I have three choices. One choice gives me a score of 5, one choice gives me a score of 3, and one choice gives me a score of 9. Well, then between those three choices, my best option is to choose this 9 over here, the score that maximizes my options out of all the three options. And so I can give this state a value of 9, because among my three options, that is the best choice that I have available to me. So that’s my decision now. You imagine it’s like one move away from the end of the game. But then you could also ask a reasonable question, what might my opponent do two moves away from the end of the game? My opponent is the minimizing player. They are trying to make the score as small as possible. Imagine what would have happened if they had to pick which choice to make. One choice leads us to this state, where I, the maximizing player, am going to opt for 9, the biggest score that I can get. And 1 leads to this state, where I, the maximizing player, would choose 8, which is then the largest score that I can get. Now the minimizing player, forced to choose between a 9 or an 8, is going to choose the smallest possible score, which in this case is an 8. And that is then how this process would unfold, that the minimizing player in this case considers both of their options, and then all of the options that would happen as a result of that. So this now is a general picture of what the minimax algorithm looks like. Let’s now try to formalize it using a little bit of pseudocode. So what exactly is happening in the minimax algorithm? Well, given a state s, we need to decide what to happen. The max player, if it’s max’s player’s turn, then max is going to pick an action a in actions of s. Recall that actions is a function that takes a state and gives me back all of the possible actions that I can take. It tells me all of the moves that are possible. The max player is going to specifically pick an action a in this set of actions that gives me the highest value of min value of result of s and a. So what does that mean? Well, it means that I want to make the option that gives me the highest score of all of the actions a. But what score is that going to have? To calculate that, I need to know what my opponent, the min player, is going to do if they try to minimize the value of the state that results. So we say, what state results after I take this action? And what happens when the min player tries to minimize the value of that state? I consider that for all of my possible options. And after I’ve considered that for all of my possible options, I pick the action a that has the highest value. Likewise, the min player is going to do the same thing but backwards. They’re also going to consider what are all of the possible actions they can take if it’s their turn. And they’re going to pick the action a that has the smallest possible value of all the options. And the way they know what the smallest possible value of all the options is is by considering what the max player is going to do by saying, what’s the result of applying this action to the current state? And then what would the max player try to do? What value would the max player calculate for that particular state? So everyone makes their decision based on trying to estimate what the other person would do. And now we need to turn our attention to these two functions, max value and min value. How do you actually calculate the value of a state if you’re trying to maximize its value? And how do you calculate the value of a state if you’re trying to minimize the value? If you can do that, then we have an entire implementation of this min and max algorithm. So let’s try it. Let’s try and implement this max value function that takes a state and returns as output the value of that state if I’m trying to maximize the value of the state. Well, the first thing I can check for is to see if the game is over. Because if the game is over, in other words, if the state is a terminal state, then this is easy. I already have this utility function that tells me what the value of the board is. If the game is over, I just check, did x win, did o win, is it a tie? And this utility function just knows what the value of the state is. What’s trickier is if the game isn’t over. Because then I need to do this recursive reasoning about thinking, what is my opponent going to do on the next move? And I want to calculate the value of this state. And I want the value of the state to be as high as possible. And I’ll keep track of that value in a variable called v. And if I want the value to be as high as possible, I need to give v an initial value. And initially, I’ll just go ahead and set it to be as low as possible. Because I don’t know what options are available to me yet. So initially, I’ll set v equal to negative infinity, which seems a little bit strange. But the idea here is I want the value initially to be as low as possible. Because as I consider my actions, I’m always going to try and do better than v. And if I set v to negative infinity, I know I can always do better than that. So now I consider my actions. And this is going to be some kind of loop where for every action in actions of state, recall actions as a function that takes my state and gives me all the possible actions that I can use in that state. So for each one of those actions, I want to compare it to v and say, all right, v is going to be equal to the maximum of v and this expression. So what is this expression? Well, first it is get the result of taking the action in the state and then get the min value of that. In other words, let’s say I want to find out from that state what is the best that the min player can do because they’re going to try and minimize the score. So whatever the resulting score is of the min value of that state, compare it to my current best value and just pick the maximum of those two because I am trying to maximize the value. In short, what these three lines of code are doing are going through all of my possible actions and asking the question, how do I maximize the score given what my opponent is going to try to do? After this entire loop, I can just return v and that is now the value of that particular state. And for the min player, it’s the exact opposite of this, the same logic just backwards. To calculate the minimum value of a state, first we check if it’s a terminal state. If it is, we return its utility. Otherwise, we’re going to now try to minimize the value of the state given all of my possible actions. So I need an initial value for v, the value of the state. And initially, I’ll set it to infinity because I know I can always get something less than infinity. So by starting with v equals infinity, I make sure that the very first action I find, that will be less than this value of v. And then I do the same thing, loop over all of my possible actions. And for each of the results that we could get when the max player makes their decision, let’s take the minimum of that and the current value of v. So after all is said and done, I get the smallest possible value of v that I then return back to the user. So that, in effect, is the pseudocode for Minimax. That is how we take a gain and figure out what the best move to make is by recursively using these max value and min value functions, where max value calls min value, min value calls max value back and forth, all the way until we reach a terminal state, at which point our algorithm can simply return the utility of that particular state. So what you might imagine is that this is going to start to be a long process, especially as games start to get more complex, as we start to add more moves and more possible options and games that might last quite a bit longer. So the next question to ask is, what sort of optimizations can we make here? How can we do better in order to use less space or take less time to be able to solve this kind of problem? And we’ll take a look at a couple of possible optimizations. But for one, we’ll take a look at this example. Again, returning to these up arrows and down arrows, let’s imagine that I now am the max player, this green arrow. I am trying to make this score as high as possible. And this is an easy game where there are just two moves. I make a move, one of these three options. And then my opponent makes a move, one of these three options, based on what move I make. And as a result, we get some value. Let’s look at the order in which I do these calculations and figure out if there are any optimizations I might be able to make to this calculation process. I’m going to have to look at these states one at a time. So let’s say I start here on the left and say, all right, now I’m going to consider, what will the min player, my opponent, try to do here? Well, the min player is going to look at all three of their possible actions and look at their value, because these are terminal states. They’re the end of the game. And so they’ll see, all right, this node is a value of four, value of eight, value of five. And the min player is going to say, well, all right, between these three options, four, eight, and five, I’ll take the smallest one. I’ll take the four. So this state now has a value of four. Then I, as the max player, say, all right, if I take this action, it will have a value of four. That’s the best that I can do, because min player is going to try and minimize my score. So now what if I take this option? We’ll explore this next. And now explore what the min player would do if I choose this action. And the min player is going to say, all right, what are the three options? The min player has options between nine, three, and seven. And so three is the smallest among nine, three, and seven. So we’ll go ahead and say this state has a value of three. So now I, as the max player, I have now explored two of my three options. I know that one of my options will guarantee me a score of four, at least. And one of my options will guarantee me a score of three. And now I consider my third option and say, all right, what happens here? Same exact logic. The min player is going to look at these three states, two, four, and six. I’ll say the minimum possible option is two. So the min player wants the two. Now I, as the max player, have calculated all of the information by looking two layers deep, by looking at all of these nodes. And I can now say, between the four, the three, and the two, you know what? I’d rather take the four. Because if I choose this option, if my opponent plays optimally, they will try and get me to the four. But that’s the best I can do. I can’t guarantee a higher score. Because if I pick either of these two options, I might get a three or I might get a two. And it’s true that down here is a nine. And that’s the highest score out of any of the scores. So I might be tempted to say, you know what? Maybe I should take this option because I might get the nine. But if the min player is playing intelligently, if they’re making the best moves at each possible option they have when they get to make a choice, I’ll be left with a three. Whereas I could better, playing optimally, have guaranteed that I would get the four. So that is, in effect, the logic that I would use as a min and max player trying to maximize my score from that node there. But it turns out they took quite a bit of computation for me to figure that out. I had to reason through all of these nodes in order to draw this conclusion. And this is for a pretty simple game where I have three choices, my opponent has three choices, and then the game’s over. So what I’d like to do is come up with some way to optimize this. Maybe I don’t need to do all of this calculation to still reach the conclusion that, you know what, this action to the left, that’s the best that I could do. Let’s go ahead and try again and try to be a little more intelligent about how I go about doing this. So first, I start the exact same way. I don’t know what to do initially, so I just have to consider one of the options and consider what the min player might do. Min has three options, four, eight, and five. And between those three options, min says four is the best they can do because they want to try to minimize the score. Now I, the max player, will consider my second option, making this move here, and considering what my opponent would do in response. What will the min player do? Well, the min player is going to, from that state, look at their options. And I would say, all right, nine is an option, three is an option. And if I am doing the math from this initial state, doing all this calculation, when I see a three, that should immediately be a red flag for me. Because when I see a three down here at this state, I know that the value of this state is going to be at most three. It’s going to be three or something less than three, even though I haven’t yet looked at this last action or even further actions if there were more actions that could be taken here. How do I know that? Well, I know that the min player is going to try to minimize my score. And if they see a three, the only way this could be something other than a three is if this remaining thing that I haven’t yet looked at is less than three, which means there is no way for this value to be anything more than three because the min player can already guarantee a three and they are trying to minimize my score. So what does that tell me? Well, it tells me that if I choose this action, my score is going to be three or maybe even less than three if I’m unlucky. But I already know that this action will guarantee me a four. And so given that I know that this action guarantees me a score of four and this action means I can’t do better than three, if I’m trying to maximize my options, there is no need for me to consider this triangle here. There is no value, no number that could go here that would change my mind between these two options. I’m always going to opt for this path that gets me a four as opposed to this path where the best I can do is a three if my opponent plays optimally. And this is going to be true for all the future states that I look at too. That if I look over here at what min player might do over here, if I see that this state is a two, I know that this state is at most a two because the only way this value could be something other than two is if one of these remaining states is less than a two and so the min player would opt for that instead. So even without looking at these remaining states, I as the maximizing player can know that choosing this path to the left is going to be better than choosing either of those two paths to the right because this one can’t be better than three. This one can’t be better than two. And so four in this case is the best that I can do. So in order to do this cut, and I can say now that this state has a value of four. So in order to do this type of calculation, I was doing a little bit more bookkeeping, keeping track of things, keeping track all the time of what is the best that I can do, what is the worst that I can do, and for each of these states saying, all right, well, if I already know that I can get a four, then if the best I can do at this state is a three, no reason for me to consider it, I can effectively prune this leaf and anything below it from the tree. And it’s for that reason this approach, this optimization to minimax, is called alpha, beta pruning. Alpha and beta stand for these two values that you’ll have to keep track of of the best you can do so far and the worst you can do so far. And pruning is the idea of if I have a big, long, deep search tree, I might be able to search it more efficiently if I don’t need to search through everything, if I can remove some of the nodes to try and optimize the way that I look through this entire search space. So alpha, beta pruning can definitely save us a lot of time as we go about the search process by making our searches more efficient. But even then, it’s still not great as games get more complex. Tic-tac-toe, fortunately, is a relatively simple game. And we might reasonably ask a question like, how many total possible tic-tac-toe games are there? You can think about it. You can try and estimate how many moves are there at any given point, how many moves long can the game last. It turns out there are about 255,000 possible tic-tac-toe games that can be played. But compare that to a more complex game, something like a game of chess, for example. Far more pieces, far more moves, games that last much longer. How many total possible chess games could there be? It turns out that after just four moves each, four moves by the white player, four moves by the black player, that there are 288 billion possible chess games that can result from that situation, after just four moves each. And going even further, if you look at entire chess games and how many possible chess games there could be as a result there, there are more than 10 to the 29,000 possible chess games, far more chess games than could ever be considered. And this is a pretty big problem for the Minimax algorithm, because the Minimax algorithm starts with an initial state, considers all the possible actions, and all the possible actions after that, all the way until we get to the end of the game. And that’s going to be a problem if the computer is going to need to look through this many states, which is far more than any computer could ever do in any reasonable amount of time. So what do we do in order to solve this problem? Instead of looking through all these states which is totally intractable for a computer, we need some better approach. And it turns out that better approach generally takes the form of something called depth-limited Minimax, where normally Minimax is depth-unlimited. We just keep going layer after layer, move after move, until we get to the end of the game. Depth-limited Minimax is instead going to say, you know what, after a certain number of moves, maybe I’ll look 10 moves ahead, maybe I’ll look 12 moves ahead, but after that point, I’m going to stop and not consider additional moves that might come after that, just because it would be computationally intractable to consider all of those possible options. But what do we do after we get 10 or 12 moves deep when we arrive at a situation where the game’s not over? Minimax still needs a way to assign a score to that game board or game state to figure out what its current value is, which is easy to do if the game is over, but not so easy to do if the game is not yet over. So in order to do that, we need to add one additional feature to depth-limited Minimax called an evaluation function, which is just some function that is going to estimate the expected utility of a game from a given state. So in a game like chess, if you imagine that a game value of 1 means white wins, negative 1 means black wins, 0 means it’s a draw, then you might imagine that a score of 0.8 means white is very likely to win, though certainly not guaranteed. And you would have an evaluation function that estimates how good the game state happens to be. And depending on how good that evaluation function is, that is ultimately what’s going to constrain how good the AI is. The better the AI is at estimating how good or how bad any particular game state is, the better the AI is going to be able to play that game. If the evaluation function is worse and not as good as it estimating what the expected utility is, then it’s going to be a whole lot harder. And you can imagine trying to come up with these evaluation functions. In chess, for example, you might write an evaluation function based on how many pieces you have as compared to how many pieces your opponent has, because each one has a value. And your evaluation function probably needs to be a little bit more complicated than that to consider other possible situations that might arise as well. And there are many other variants on Minimax that add additional features in order to help it perform better under these larger, more computationally untractable situations where we couldn’t possibly explore all of the possible moves. So we need to figure out how to use evaluation functions and other techniques to be able to play these games ultimately better. But this now was a look at this kind of adversarial search, these search problems where we have situations where I am trying to play against some sort of opponent. And these search problems show up all over the place throughout artificial intelligence. We’ve been talking a lot today about more classical search problems, like trying to find directions from one location to another. But any time an AI is faced with trying to make a decision, like what do I do now in order to do something that is rational, or do something that is intelligent, or trying to play a game, like figuring out what move to make, these sort of algorithms can really come in handy. It turns out that for tic-tac-toe, the solution is pretty simple because it’s a small game. XKCD has famously put together a web comic where he will tell you exactly what move to make as the optimal move to make no matter what your opponent happens to do. This type of thing is not quite as possible for a much larger game like Checkers or Chess, for example, where chess is totally computationally untractable for most computers to be able to explore all the possible states. So we really need our AI to be far more intelligent about how they go about trying to deal with these problems and how they go about taking this environment that they find themselves in and ultimately searching for one of these solutions. So this, then, was a look at search in artificial intelligence. Next time, we’ll take a look at knowledge, thinking about how it is that our AIs are able to know information, reason about that information, and draw conclusions, all in our look at AI and the principles behind it. We’ll see you next time. [“AIMS INTRO MUSIC”] All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Last time, we took a look at search problems, in particular, where we have AI agents that are trying to solve some sort of problem by taking actions in some sort of environment, whether that environment is trying to take actions by playing moves in a game or whether those actions are something like trying to figure out where to make turns in order to get driving directions from point A to point B. This time, we’re going to turn our attention more generally to just this idea of knowledge, the idea that a lot of intelligence is based on knowledge, especially if we think about human intelligence. People know information. We know facts about the world. And using that information that we know, we’re able to draw conclusions, reason about the information that we know in order to figure out how to do something or figure out some other piece of information that we conclude based on the information we already have available to us. What we’d like to focus on now is the ability to take this idea of knowledge and being able to reason based on knowledge and apply those ideas to artificial intelligence. In particular, we’re going to be building what are known as knowledge-based agents, agents that are able to reason and act by representing knowledge internally. Somehow inside of our AI, they have some understanding of what it means to know something. And ideally, they have some algorithms or some techniques they can use based on that knowledge that they know in order to figure out the solution to a problem or figure out some additional piece of information that can be helpful in some sense. So what do we mean by reasoning based on knowledge to be able to draw conclusions? Well, let’s look at a simple example drawn from the world of Harry Potter. We take one sentence that we know to be true. Imagine if it didn’t rain, then Harry visited Hagrid today. So one fact that we might know about the world. And then we take another fact. Harry visited Hagrid or Dumbledore today, but not both. So it tells us something about the world, that Harry either visited Hagrid but not Dumbledore, or Harry visited Dumbledore but not Hagrid. And now we have a third piece of information about the world that Harry visited Dumbledore today. So we now have three pieces of information now, three facts. Inside of a knowledge base, so to speak, information that we know. And now we, as humans, can try and reason about this and figure out, based on this information, what additional information can we begin to conclude? And well, looking at these last two statements, Harry either visited Hagrid or Dumbledore but not both, and we know that Harry visited Dumbledore today, well, then it’s pretty reasonable that we could draw the conclusion that, you know what, Harry must not have visited Hagrid today. Because based on a combination of these two statements, we can draw this inference, so to speak, a conclusion that Harry did not visit Hagrid today. But it turns out we can even do a little bit better than that, get some more information by taking a look at this first statement and reasoning about that. This first statement says, if it didn’t rain, then Harry visited Hagrid today. So what does that mean? In all cases where it didn’t rain, then we know that Harry visited Hagrid. But if we also know now that Harry did not visit Hagrid, then that tells us something about our initial premise that we were thinking about. In particular, it tells us that it did rain today, because we can reason, if it didn’t rain, that Harry would have visited Hagrid. But we know for a fact that Harry did not visit Hagrid today. So it’s this kind of reason, this sort of logical reasoning, where we use logic based on the information that we know in order to take information and reach conclusions that is going to be the focus of what we’re going to be talking about today. How can we make our artificial intelligence logical so that they can perform the same kinds of deduction, the same kinds of reasoning that we’ve been doing so far? Of course, humans reason about logic generally in terms of human language. That I just now was speaking in English, talking in English about these sentences and trying to reason through how it is that they relate to one another. We’re going to need to be a little bit more formal when we turn our attention to computers and being able to encode this notion of logic and truthhood and falsehood inside of a machine. So we’re going to need to introduce a few more terms and a few symbols that will help us reason through this idea of logic inside of an artificial intelligence. And we’ll begin with the idea of a sentence. Now, a sentence in a natural language like English is just something that I’m saying, like what I’m saying right now. In the context of AI, though, a sentence is just an assertion about the world in what we’re going to call a knowledge representation language, some way of representing knowledge inside of our computers. And the way that we’re going to spend most of today reasoning about knowledge is through a type of logic known as propositional logic. There are a number of different types of logic, some of which we’ll touch on. But propositional logic is based on a logic of propositions, or just statements about the world. And so we begin in propositional logic with a notion of propositional symbols. We will have certain symbols that are oftentimes just letters, something like P or Q or R, where each of those symbols is going to represent some fact or sentence about the world. So P, for example, might represent the fact that it is raining. And so P is going to be a symbol that represents that idea. And Q, for example, might represent Harry visited Hagrid today. Each of these propositional symbols represents some sentence or some fact about the world. But in addition to just having individual facts about the world, we want some way to connect these propositional symbols together in order to reason more complexly about other facts that might exist inside of the world in which we’re reasoning. So in order to do that, we’ll need to introduce some additional symbols that are known as logical connectives. Now, there are a number of these logical connectives. But five of the most important, and the ones we’re going to focus on today, are these five up here, each represented by a logical symbol. Not is represented by this symbol here, and is represented as sort of an upside down V, or is represented by a V shape. Implication, and we’ll talk about what that means in just a moment, is represented by an arrow. And biconditional, again, we’ll talk about what that means in a moment, is represented by these double arrows. But these five logical connectives are the main ones we’re going to be focusing on in terms of thinking about how it is that a computer can reason about facts and draw conclusions based on the facts that it knows. But in order to get there, we need to take a look at each of these logical connectives and build up an understanding for what it is that they actually mean. So let’s go ahead and begin with the not symbol, so this not symbol here. And what we’re going to show for each of these logical connectives is what we’re going to call a truth table, a table that demonstrates what this word not means when we attach it to a propositional symbol or any sentence inside of our logical language. And so the truth table for not is shown right here. If P, some propositional symbol, or some other sentence even, is false, then not P is true. And if P is true, then not P is false. So you can imagine that placing this not symbol in front of some sentence of propositional logic just says the opposite of that. So if, for example, P represented it is raining, then not P would represent the idea that it is not raining. And as you might expect, if P is false, meaning if the sentence, it is raining, is false, well then the sentence not P must be true. The sentence that it is not raining is therefore true. So not, you can imagine, just takes whatever is in P and it inverts it. It turns false into true and true into false, much analogously to what the English word not means, just taking whatever comes after it and inverting it to mean the opposite. Next up, and also very English-like, is this idea of and represented by this upside-down V shape or this point shape. And as opposed to just taking a single argument the way not does, we have P and we have not P. And is going to combine two different sentences in propositional logic together. So I might have one sentence P and another sentence Q, and I want to combine them together to say P and Q. And the general logic for what P and Q means is it means that both of its operands are true. P is true and also Q is true. And so here’s what that truth table looks like. This time we have two variables, P and Q. And when we have two variables, each of which can be in two possible states, true or false, that leads to two squared or four possible combinations of truth and falsehood. So we have P is false and Q is false. We have P is false and Q is true. P is true and Q is false. And then P and Q both are true. And those are the only four possibilities for what P and Q could mean. And in each of those situations, this third column here, P and Q, is telling us a little bit about what it actually means for P and Q to be true. And we see that the only case where P and Q is true is in this fourth row here, where P happens to be true, Q also happens to be true. And in all other situations, P and Q is going to evaluate to false. So this, again, is much in line with what our intuition of and might mean. If I say P and Q, I probably mean that I expect both P and Q to be true. Next up, also potentially consistent with what we mean, is this word or, represented by this V shape, sort of an upside down and symbol. And or, as the name might suggest, is true if either of its arguments are true, as long as P is true or Q is true, then P or Q is going to be true. Which means the only time that P or Q is false is if both of its operands are false. If P is false and Q is false, then P or Q is going to be false. But in all other cases, at least one of the operands is true. Maybe they’re both true, in which case P or Q is going to evaluate to true. Now, this is mostly consistent with the way that most people might use the word or, in the sense of speaking the word or in normal English, though there is sometimes when we might say or, where we mean P or Q, but not both, where we mean, sort of, it can only be one or the other. It’s important to note that this symbol here, this or, means P or Q or both, that those are totally OK. As long as either or both of them are true, then the or is going to evaluate to be true, as well. It’s only in the case where all of the operands are false that P or Q ultimately evaluates to false, as well. In logic, there’s another symbol known as the exclusive or, which encodes this idea of exclusivity of one or the other, but not both. But we’re not going to be focusing on that today. Whenever we talk about or, we’re always talking about either or both, in this case, as represented by this truth table here. So that now is not an and an or. And next up is what we might call implication, as denoted by this arrow symbol. So we have P and Q. And this sentence here will generally read as P implies Q. And what P implies Q means is that if P is true, then Q is also true. So I might say something like, if it is raining, then I will be indoors. Meaning, it is raining implies I will be indoors, as the logical sentence that I’m saying there. And the truth table for this can sometimes be a little bit tricky. So obviously, if P is true and Q is true, then P implies Q. That’s true. That definitely makes sense. And it should also stand to reason that when P is true and Q is false, then P implies Q is false. Because if I said to you, if it is raining, then I will be out indoors. And it is raining, but I’m not indoors? Well, then it would seem to be that my original statement was not true. P implies Q means that if P is true, then Q also needs to be true. And if it’s not, well, then the statement is false. What’s also worth noting, though, is what happens when P is false. When P is false, the implication makes no claim at all. If I say something like, if it is raining, then I will be indoors. And it turns out it’s not raining. Then in that case, I am not making any statement as to whether or not I will be indoors or not. P implies Q just means that if P is true, Q must be true. But if P is not true, then we make no claim about whether or not Q is true at all. So in either case, if P is false, it doesn’t matter what Q is. Whether it’s false or true, we’re not making any claim about Q whatsoever. We can still evaluate the implication to true. The only way that the implication is ever false is if our premise, P, is true, but the conclusion that we’re drawing Q happens to be false. So in that case, we would say P does not imply Q in that case. Finally, the last connective that we’ll discuss is this bi-conditional. You can think of a bi-conditional as a condition that goes in both directions. So originally, when I said something like, if it is raining, then I will be indoors. I didn’t say what would happen if it wasn’t raining. Maybe I’ll be indoors, maybe I’ll be outdoors. This bi-conditional, you can read as an if and only if. So I can say, I will be indoors if and only if it is raining, meaning if it is raining, then I will be indoors. And if I am indoors, it’s reasonable to conclude that it is also raining. So this bi-conditional is only true when P and Q are the same. So if P is true and Q is true, then this bi-conditional is also true. P implies Q, but also the reverse is true. Q also implies P. So if P and Q both happen to be false, we would still say it’s true. But in any of these other two situations, this P if and only if Q is going to ultimately evaluate to false. So a lot of trues and falses going on there, but these five basic logical connectives are going to form the core of the language of propositional logic, the language that we’re going to use in order to describe ideas, and the language that we’re going to use in order to reason about those ideas in order to draw conclusions. So let’s now take a look at some of the additional terms that we’ll need to know about in order to go about trying to form this language of propositional logic and writing AI that’s actually able to understand this sort of logic. The next thing we’re going to need is the notion of what is actually true about the world. We have a whole bunch of propositional symbols, P and Q and R and maybe others, but we need some way of knowing what actually is true in the world. Is P true or false? Is Q true or false? So on and so forth. And to do that, we’ll introduce the notion of a model. A model just assigns a truth value, where a truth value is either true or false, to every propositional symbol. In other words, it’s creating what we might call a possible world. So let me give an example. If, for example, I have two propositional symbols, P is it is raining and Q is it is a Tuesday, a model just takes each of these two symbols and assigns a truth value to them, either true or false. So here’s a sample model. In this model, in other words, in this possible world, it is possible that P is true, meaning it is raining, and Q is false, meaning it is not a Tuesday. But there are other possible worlds or other models as well. There is some model where both of these variables are true, some model where both of these variables are false. In fact, if there are n variables that are propositional symbols like this that are either true or false, then the number of possible models is 2 to the n, because each of these possible models, possible variables within my model, could be set to either true or false if I don’t know any information about it. So now that I have the symbols and the connectives that I’m going to need in order to construct these parts of knowledge, we need some way to represent that knowledge. And to do so, we’re going to allow our AI access to what we’ll call a knowledge base. And a knowledge base is really just a set of sentences that our AI knows to be true. Some set of sentences in propositional logic that are things that our AI knows about the world. And so we might tell our AI some information, information about a situation that it finds itself in, or a situation about a problem that it happens to be trying to solve. And we would give that information to the AI that the AI would store inside of its knowledge base. And what happens next is the AI would like to use that information in the knowledge base to be able to draw conclusions about the rest of the world. And what do those conclusions look like? Well, to understand those conclusions, we’ll need to introduce one more idea, one more symbol. And that is the notion of entailment. So this sentence here, with this double turnstile in these Greek letters, this is the Greek letter alpha and the Greek letter beta. And we read this as alpha entails beta. And alpha and beta here are just sentences in propositional logic. And what this means is that alpha entails beta means that in every model, in other words, in every possible world in which sentence alpha is true, then sentence beta is also true. So if something entails something else, if alpha entails beta, it means that if I know alpha to be true, then beta must therefore also be true. So if my alpha is something like I know that it is a Tuesday in January, then a reasonable beta might be something like I know that it is January. Because in all worlds where it is a Tuesday in January, I know for sure that it must be January, just by definition. This first statement or sentence about the world entails the second statement. And we can reasonably use deduction based on that first sentence to figure out that the second sentence is, in fact, true as well. And ultimately, it’s this idea of entailment that we’re going to try and encode into our computer. We want our AI agent to be able to figure out what the possible entailments are. We want our AI to be able to take these three sentences, sentences like, if it didn’t rain, Harry visited Hagrid. That Harry visited Hagrid or Dumbledore, but not both. And that Harry visited Dumbledore. And just using that information, we’d like our AI to be able to infer or figure out that using these three sentences inside of a knowledge base, we can draw some conclusions. In particular, we can draw the conclusions here that, one, Harry did not visit Hagrid today. And we can draw the entailment, too, that it did, in fact, rain today. And this process is known as inference. And that’s what we’re going to be focusing on today, this process of deriving new sentences from old ones, that I give you these three sentences, you put them in the knowledge base in, say, the AI. And the AI is able to use some sort of inference algorithm to figure out that these two sentences must also be true. And that is how we define inference. So let’s take a look at an inference example to see how we might actually go about inferring things in a human sense before we take a more algorithmic approach to see how we could encode this idea of inference in AI. And we’ll see there are a number of ways that we can actually achieve this. So again, we’ll deal with a couple of propositional symbols. We’ll deal with P, Q, and R. P is it is a Tuesday. Q is it is raining. And R is Harry will go for a run, three propositional symbols that we are just defining to mean this. We’re not saying anything yet about whether they’re true or false. We’re just defining what they are. Now, we’ll give ourselves or an AI access to a knowledge base, abbreviated to KB, the knowledge that we know about the world. We know this statement. All right. So let’s try to parse it. The parentheses here are just used for precedent, so we can see what associates with what. But you would read this as P and not Q implies R. All right. So what does that mean? Let’s put it piece by piece. P is it is a Tuesday. Q is it is raining, so not Q is it is not raining, and implies R is Harry will go for a run. So the way to read this entire sentence in human natural language at least is if it is a Tuesday and it is not raining, then Harry will go for a run. So if it is a Tuesday and it is not raining, then Harry will go for a run. And that is now inside of our knowledge base. And let’s now imagine that our knowledge base has two other pieces of information as well. It has information that P is true, that it is a Tuesday. And we also have the information not Q, that it is not raining, that this sentence Q, it is raining, happens to be false. And those are the three sentences that we have access to. P and not Q implies R, P and not Q. Using that information, we should be able to draw some inferences. P and not Q is only true if both P and not Q are true. All right, we know that P is true and we know that not Q is true. So we know that this whole expression is true. And the definition of implication is if this whole thing on the left is true, then this thing on the right must also be true. So if we know that P and not Q is true, then R must be true as well. So the inference we should be able to draw from all of this is that R is true and we know that Harry will go for a run by taking this knowledge inside of our knowledge base and being able to reason based on that idea. And so this ultimately is the beginning of what we might consider to be some sort of inference algorithm, some process that we can use to try and figure out whether or not we can draw some conclusion. And ultimately, what these inference algorithms are going to answer is the central question about entailment. Given some query about the world, something we’re wondering about the world, and we’ll call that query alpha, the question we want to ask using these inference algorithms is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base, the knowledge that we have access to, can we conclude that this sentence alpha is true? And that’s ultimately what we would like to do. So how can we do that? How can we go about writing an algorithm that can look at this knowledge base and figure out whether or not this query alpha is actually true? Well, it turns out there are a couple of different algorithms for doing so. And one of the simplest, perhaps, is known as model checking. Now, remember that a model is just some assignment of all of the propositional symbols inside of our language to a truth value, true or false. And you can think of a model as a possible world, that there are many possible worlds where different things might be true or false, and we can enumerate all of them. And the model checking algorithm does exactly that. So what does our model checking algorithm do? Well, if we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models. In other words, consider all possible values of true and false for our variables, all possible states in which our world can be in. And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha. So let’s take a closer look at that sentence and try and figure out what it actually means. If we know that in every model, in other words, in every possible world, no matter what assignment of true and false to variables you give, if we know that whenever our knowledge is true, what we know to be true is true, that this query alpha is also true, well, then it stands to reason that as long as our knowledge base is true, then alpha must also be true. And so this is going to form the foundation of our model checking algorithm. We’re going to enumerate all of the possible worlds and ask ourselves whenever the knowledge base is true, is alpha true? And if that’s the case, then we know alpha to be true. And otherwise, there is no entailment. Our knowledge base does not entail alpha. All right. So this is a little bit abstract, but let’s take a look at an example to try and put real propositional symbols to this idea. So again, we’ll work with the same example. P is it is a Tuesday, Q is it is raining, R as Harry will go for a run. Our knowledge base contains these pieces of information. P and not Q implies R. We also know P. It is a Tuesday and not Q. It is not raining. And our query, our alpha in this case, the thing we want to ask is R. We want to know, is it guaranteed? Is it entailed that Harry will go for a run? So the first step is to enumerate all of the possible models. We have three propositional symbols here, P, Q, and R, which means we have 2 to the third power, or eight possible models. All false, false, false true, false true, false, false true, true, et cetera. Eight possible ways you could assign true and false to all of these models. And we might ask in each one of them, is the knowledge base true? Here are the set of things that we know. In which of these worlds could this knowledge base possibly apply to? In which world is this knowledge base true? Well, in the knowledge base, for example, we know P. We know it is a Tuesday, which means we know that these four first four rows where P is false, none of those are going to be true or are going to work for this particular knowledge base. Our knowledge base is not true in those worlds. Likewise, we also know not Q. We know that it is not raining. So any of these models where Q is true, like these two and these two here, those aren’t going to work either because we know that Q is not true. And finally, we also know that P and not Q implies R, which means that when P is true or P is true here and Q is false, Q is false in these two, then R must be true. And if ever P is true, Q is false, but R is also false, well, that doesn’t satisfy this implication here. That implication does not hold true under those situations. So we could say that for our knowledge base, we can conclude under which of these possible worlds is our knowledge base true and under which of the possible worlds is our knowledge base false. And it turns out there is only one possible world where our knowledge base is actually true. In some cases, there might be multiple possible worlds where the knowledge base is true. But in this case, it just so happens that there’s only one, one possible world where we can definitively say something about our knowledge base. And in this case, we would look at the query. The query of R is R true, R is true, and so as a result, we can draw that conclusion. And so this is this idea of model check-in. Enumerate all the possible models and look in those possible models to see whether or not, if our knowledge base is true, is the query in question true as well. So let’s now take a look at how we might actually go about writing this in a programming language like Python. Take a look at some actual code that would encode this notion of propositional symbols and logic and these connectives like and and or and not and implication and so forth and see what that code might actually look like. So I’ve written in advance a logic library that’s more detailed than we need to worry about entirely today. But the important thing is that we have one class for every type of logical symbol or connective that we might have. So we just have one class for logical symbols, for example, where every symbol is going to represent and store some name for that particular symbol. And we also have a class for not that takes an operand. So we might say not one symbol to say something is not true or some other sentence is not true. We have one for and, one for or, so on and so forth. And I’ll just demonstrate how this works. And you can take a look at the actual logic.py later on. But I’ll go ahead and call this file harry.py. We’re going to store information about this world of Harry Potter, for example. So I’ll go ahead and import from my logic module. I’ll import everything. And in this library, in order to create a symbol, you use capital S symbol. And I’ll create a symbol for rain, to mean it is raining, for example. And I’ll create a symbol for Hagrid, to mean Harry visited Hagrid, is what this symbol is going to mean. So this symbol means it is raining. This symbol means Harry visited Hagrid. And I’ll add another symbol called Dumbledore for Harry visited Dumbledore. Now, I’d like to save these symbols so that I can use them later as I do some logical analysis. So I’ll go ahead and save each one of them inside of a variable. So like rain, Hagrid, and Dumbledore, so you could call the variables anything. And now that I have these logical symbols, I can use logical connectives to combine them together. So for example, if I have a sentence like and rain and Hagrid, for example, which is not necessarily true, but just for demonstration, I can now try and print out sentence.formula, which is a function I wrote that takes a sentence in propositional logic and just prints it out so that we, the programmers, can now see this in order to get an understanding for how it actually works. So if I run python harry.py, what we’ll see is this sentence in propositional logic, rain and Hagrid. This is the logical representation of what we have here in our Python program of saying and whose arguments are rain and Hagrid. So we’re saying rain and Hagrid by encoding that idea. And this is quite common in Python object-oriented programming, where you have a number of different classes, and you pass arguments into them in order to create a new and object, for example, in order to represent this idea. But now what I’d like to do is somehow encode the knowledge that I have about the world in order to solve that problem from the beginning of class, where we talked about trying to figure out who Harry visited and trying to figure out if it’s raining or if it’s not raining. And so what knowledge do I have? I’ll go ahead and create a new variable called knowledge. And what do I know? Well, I know the very first sentence that we talked about was the idea that if it is not raining, then Harry will visit Hagrid. So all right, how do I encode the idea that it is not raining? Well, I can use not and then the rain symbol. So here’s me saying that it is not raining. And now the implication is that if it is not raining, then Harry visited Hagrid. So I’ll wrap this inside of an implication to say, if it is not raining, this first argument to the implication will then Harry visited Hagrid. So I’m saying implication, the premise is that it’s not raining. And if it is not raining, then Harry visited Hagrid. And I can print out knowledge.formula to see the logical formula equivalent of that same idea. So I run Python of harry.py. And this is the logical formula that we see as a result, which is a text-based version of what we were looking at before, that if it is not raining, then that implies that Harry visited Hagrid. But there was additional information that we had access to as well. In this case, we had access to the fact that Harry visited either Hagrid or Dumbledore. So how do I encode that? Well, this means that in my knowledge, I’ve really got multiple pieces of knowledge going on. I know one thing and another thing and another thing. So I’ll go ahead and wrap all of my knowledge inside of an and. And I’ll move things on to new lines just for good measure. But I know multiple things. So I’m saying knowledge is an and of multiple different sentences. I know multiple different sentences to be true. One such sentence that I know to be true is this implication, that if it is not raining, then Harry visited Hagrid. Another such sentence that I know to be true is or Hagrid Dumbledore. In other words, Hagrid or Dumbledore is true, because I know that Harry visited Hagrid or Dumbledore. But I know more than that, actually. That initial sentence from before said that Harry visited Hagrid or Dumbledore, but not both. So now I want a sentence that will encode the idea that Harry didn’t visit both Hagrid and Dumbledore. Well, the notion of Harry visiting Hagrid and Dumbledore would be represented like this, and of Hagrid and Dumbledore. And if that is not true, if I want to say not that, then I’ll just wrap this whole thing inside of a not. So now these three lines, line 8 says that if it is not raining, then Harry visited Hagrid. Line 9 says Harry visited Hagrid or Dumbledore. And line 10 says Harry didn’t visit both Hagrid and Dumbledore, that it is not true that both the Hagrid symbol and the Dumbledore symbol are true. Only one of them can be true. And finally, the last piece of information that I knew was the fact that Harry visited Dumbledore. So these now are the pieces of knowledge that I know, one sentence and another sentence and another and another. And I can print out what I know just to see it a little bit more visually. And here now is a logical representation of the information that my computer is now internally representing using these various different Python objects. And again, take a look at logic.py if you want to take a look at how exactly it’s implementing this, but no need to worry too much about all of the details there. We’re here saying that if it is not raining, then Harry visited Hagrid. We’re saying that Hagrid or Dumbledore is true. And we’re saying it is not the case that Hagrid and Dumbledore is true, that they’re not both true. And we also know that Dumbledore is true. So this long logical sentence represents our knowledge base. It is the thing that we know. And now what we’d like to do is we’d like to use model checking to ask a query, to ask a question like, based on this information, do I know whether or not it’s raining? And we as humans were able to logic our way through it and figure out that, all right, based on these sentences, we can conclude this and that to figure out that, yes, it must have been raining. But now we’d like for the computer to do that as well. So let’s take a look at the model checking algorithm that is going to follow that same pattern that we drew out in pseudocode a moment ago. So I’ve defined a function here in logic.py that you can take a look at called model check. Model check takes two arguments, the knowledge that I already know, and the query. And the idea is, in order to do model checking, I need to enumerate all of the possible models. And for each of the possible models, I need to ask myself, is the knowledge base true? And is the query true? So the first thing I need to do is somehow enumerate all of the possible models, meaning for all possible symbols that exist, I need to assign true and false to each one of them and see whether or not it’s still true. And so here is the way we’re going to do that. We’re going to start. So I’ve defined another helper function internally that we’ll get to in just a moment. But this function starts by getting all of the symbols in both the knowledge and the query, by figuring out what symbols am I dealing with. In this case, the symbols I’m dealing with are rain and Hagrid and Dumbledore, but there might be other symbols depending on the problem. And we’ll take a look soon at some examples of situations where ultimately we’re going to need some additional symbols in order to represent the problem. And then we’re going to run this check all function, which is a helper function that’s basically going to recursively call itself checking every possible configuration of propositional symbols. So we start out by looking at this check all function. And what do we do? So if not symbols means if we finish assigning all of the symbols. We’ve assigned every symbol a value. So far we haven’t done that, but if we ever do, then we check. In this model, is the knowledge true? That’s what this line is saying. If we evaluate the knowledge propositional logic formula using the model’s assignment of truth values, is the knowledge true? If the knowledge is true, then we should return true only if the query is true. Because if the knowledge is true, we want the query to be true as well in order for there to be entailment. Otherwise, we don’t know that there otherwise there won’t be an entailment if there’s ever a situation where what we know in our knowledge is true, but the query, the thing we’re asking, happens to be false. So this line here is checking that same idea that in all worlds where the knowledge is true, the query must also be true. Otherwise, we can just return true because if the knowledge isn’t true, then we don’t care. This is equivalent to when we were enumerating this table from a moment ago. In all situations where the knowledge base wasn’t true, all of these seven rows here, we didn’t care whether or not our query was true or not. We only care to check whether the query is true when the knowledge base is actually true, which was just this green highlighted row right there. So that logic is encoded using that statement there. And otherwise, if we haven’t assigned symbols yet, which we haven’t seen anything yet, then the first thing we do is pop one of the symbols. I make a copy of the symbols first just to save an existing copy. But I pop one symbol off of the remaining symbols so that I just pick one symbol at random. And I create one copy of the model where that symbol is true. And I create a second copy of the model where that symbol is false. So I now have two copies of the model, one where the symbol is true and one where the symbol is false. And I need to make sure that this entailment holds in both of those models. So I recursively check all on the model where the statement is true and check all on the model where the statement is false. So again, you can take a look at that function to try to get a sense for how exactly this logic is working. But in effect, what it’s doing is recursively calling this check all function again and again and again. And on every level of the recursion, we’re saying let’s pick a new symbol that we haven’t yet assigned, assign it to true and assign it to false, and then check to make sure that the entailment holds in both cases. Because ultimately, I need to check every possible world. I need to take every combination of symbols and try every combination of true and false in order to figure out whether the entailment relation actually holds. So that function we’ve written for you. But in order to use that function inside of harry.py, what I’ll write is something like this. I would like to model check based on the knowledge. And then I provide as a second argument what the query is, what the thing I want to ask is. And what I want to ask in this case is, is it raining? So model check again takes two arguments. The first argument is the information that I know, this knowledge, which in this case is this information that was given to me at the beginning. And the second argument, rain, is encoding the idea of the query. What am I asking? I would like to ask, based on this knowledge, do I know for sure that it is raining? And I can try and print out the result of that. And when I run this program, I see that the answer is true. That based on this information, I can conclusively say that it is raining, because using this model checking algorithm, we were able to check that in every world where this knowledge is true, it is raining. In other words, there is no world where this knowledge is true, and it is not raining. So you can conclude that it is, in fact, raining. And this sort of logic can be applied to a number of different types of problems, that if confronted with a problem where some sort of logical deduction can be used in order to try to solve it, you might try thinking about what propositional symbols you might need in order to represent that information, and what statements and propositional logic you might use in order to encode that information which you know. And this process of trying to take a problem and figure out what propositional symbols to use in order to encode that idea, or how to represent it logically, is known as knowledge engineering. That software engineers and AI engineers will take a problem and try and figure out how to distill it down into knowledge that is representable by a computer. And if we can take any general purpose problem, some problem that we find in the human world, and turn it into a problem that computers know how to solve as by using any number of different variables, well, then we can take a computer that is able to do something like model checking or some other inference algorithm and actually figure out how to solve that problem. So now we’ll take a look at two or three examples of knowledge engineering and practice, of taking some problem and figuring out how we can apply logical symbols and use logical formulas to be able to encode that idea. And we’ll start with a very popular board game in the US and the UK known as Clue. Now, in the game of Clue, there’s a number of different factors that are going on. But the basic premise of the game, if you’ve never played it before, is that there are a number of different people. For now, we’ll just use three, Colonel Mustard, Professor Plumb, and Miss Scarlet. There are a number of different rooms, like a ballroom, a kitchen, and a library. And there are a number of different weapons, a knife, a revolver, and a wrench. And three of these, one person, one room, and one weapon, is the solution to the mystery, the murderer and what room they were in and what weapon they happened to use. And what happens at the beginning of the game is that all these cards are randomly shuffled together. And three of them, one person, one room, and one weapon, are placed into a sealed envelope that we don’t know. And we would like to figure out, using some sort of logical process, what’s inside the envelope, which person, which room, and which weapon. And we do so by looking at some, but not all, of these cards here, by looking at these cards to try and figure out what might be going on. And so this is a very popular game. But let’s now try and formalize it and see if we could train a computer to be able to play this game by reasoning through it logically. So in order to do this, we’ll begin by thinking about what propositional symbols we’re ultimately going to need. Remember, again, that propositional symbols are just some symbol, some variable, that can be either true or false in the world. And so in this case, the propositional symbols are really just going to correspond to each of the possible things that could be inside the envelope. Mustard is a propositional symbol that, in this case, will just be true if Colonel Mustard is inside the envelope, if he is the murderer, and false otherwise. And likewise for Plum, for Professor Plum, and Scarlet, for Miss Scarlet. And likewise for each of the rooms and for each of the weapons. We have one propositional symbol for each of these ideas. Then using those propositional symbols, we can begin to create logical sentences, create knowledge that we know about the world. So for example, we know that someone is the murderer, that one of the three people is, in fact, the murderer. And how would we encode that? Well, we don’t know for sure who the murderer is. But we know it is one person or the second person or the third person. So I could say something like this. Mustard or Plum or Scarlet. And this piece of knowledge encodes that one of these three people is the murderer. We don’t know which, but one of these three things must be true. What other information do we know? Well, we know that, for example, one of the rooms must have been the room in the envelope. The crime was committed either in the ballroom or the kitchen or the library. Again, right now, we don’t know which. But this is knowledge we know at the outset, knowledge that one of these three must be inside the envelope. And likewise, we can say the same thing about the weapon, that it was either the knife or the revolver or the wrench, that one of those weapons must have been the weapon of choice and therefore the weapon in the envelope. And then as the game progresses, the gameplay works by people get various different cards. And using those cards, you can deduce information. That if someone gives you a card, for example, I have the Professor Plum card in my hand, then I know the Professor Plum card can’t be inside the envelope. I know that Professor Plum is not the criminal, so I know a piece of information like not Plum, for example. I know that Professor Plum has to be false. This propositional symbol is not true. And sometimes I might not know for sure that a particular card is not in the middle, but sometimes someone will make a guess and I’ll know that one of three possibilities is not true. Someone will guess Colonel Mustard in the library with the revolver or something to that effect. And in that case, a card might be revealed that I don’t see. But if it is a card and it is either Colonel Mustard or the revolver or the library, then I know that at least one of them can’t be in the middle. So I know something like it is either not Mustard or it is not the library or it is not the revolver. Now maybe multiple of these are not true, but I know that at least one of Mustard, Library, and Revolver must, in fact, be false. And so this now is a propositional logic representation of this game of Clue, a way of encoding the knowledge that we know inside this game using propositional logic that a computer algorithm, something like model checking that we saw a moment ago, can actually look at and understand. So let’s now take a look at some code to see how this algorithm might actually work in practice. All right, so I’m now going to open up a file called Clue.py, which I’ve started already. And what we’ll see here is I’ve defined a couple of things. To find some symbols initially, notice I have a symbol for Colonel Mustard, a symbol for Professor Plum, a symbol for Miss Scarlett, all of which I’ve put inside of this list of characters. I have a symbol for Ballroom and Kitchen and Library inside of a list of rooms. And then I have symbols for Knife and Revolver and Wrench. These are my weapons. And so all of these characters and rooms and weapons altogether, those are my symbols. And now I also have this check knowledge function. And what the check knowledge function does is it takes my knowledge and it’s going to try and draw conclusions about what I know. So for example, we’ll loop over all of the possible symbols and we’ll check, do I know that that symbol is true? And a symbol is going to be something like Professor Plum or the Knife or the Library. And if I know that it is true, in other words, I know that it must be the card in the envelope, then I’m going to print out using a function called cprint, which prints things in color. I’m going to print out the word yes, and I’m going to print that in green, just to make it very clear to us. If we’re not sure that the symbol is true, maybe I can check to see if I’m sure that the symbol is not true. Like if I know for sure that it is not Professor Plum, for example. And I do that by running model check again, this time checking if my knowledge is not the symbol, if I know for sure that the symbol is not true. And if I don’t know for sure that the symbol is not true, because I say if not model check, meaning I’m not sure that the symbol is false, well, then I’ll go ahead and print out maybe next to the symbol. Because maybe the symbol is true, maybe it’s not, I don’t actually know. So what knowledge do I actually have? Well, let’s try and represent my knowledge now. So my knowledge is, I know a couple of things, so I’ll put them in an and. And I know that one of the three people must be the criminal. So I know or mustard, plum, scarlet. This is my way of encoding that it is either Colonel Mustard or Professor Plum or Miss Scarlet. I know that it must have happened in one of the rooms. So I know or ballroom, kitchen, library, for example. And I know that one of the weapons must have been used as well. So I know or knife, revolver, wrench. So that might be my initial knowledge, that I know that it must have been one of the people, I know it must have been in one of the rooms, and I know that it must have been one of the weapons. And I can see what that knowledge looks like as a formula by printing out knowledge.formula. So I’ll run python clue.py. And here now is the information that I know in logical format. I know that it is Colonel Mustard or Professor Plum or Miss Scarlet. And I know that it is the ballroom, the kitchen, or the library. And I know that it is the knife, the revolver, or the wrench. But I don’t know much more than that. I can’t really draw any firm conclusions. And in fact, we can see that if I try and do, let me go ahead and run my knowledge check function on my knowledge. Knowledge check is this function that I, or check knowledge rather, is this function that I just wrote that looks over all of the symbols and tries to see what conclusions I can actually draw about any of the symbols. So I’ll go ahead and run clue.py and see what it is that I know. And it seems that I don’t really know anything for sure. I have all three people are maybes, all three of the rooms are maybes, all three of the weapons are maybes. I don’t really know anything for certain just yet. But now let me try and add some additional information and see if additional information, additional knowledge, can help us to logically reason our way through this process. And we are just going to provide the information. Our AI is going to take care of doing the inference and figuring out what conclusions it’s able to draw. So I start with some cards. And those cards tell me something. So if I have the kernel mustard card, for example, I know that the mustard symbol must be false. In other words, mustard is not the one in the envelope, is not the criminal. So I can say, knowledge supports something called, every and in this library supports dot add, which is a way of adding knowledge or adding an additional logical sentence to an and clause. So I can say, knowledge dot add, not mustard. I happen to know, because I have the mustard card, that kernel mustard is not the suspect. And maybe I have a couple of other cards too. Maybe I also have a card for the kitchen. So I know it’s not the kitchen. And maybe I have another card that says that it is not the revolver. So I have three cards, kernel mustard, the kitchen, and the revolver. And I encode that into my AI this way by saying, it’s not kernel mustard, it’s not the kitchen, and it’s not the revolver. And I know those to be true. So now, when I rerun clue.py, we’ll see that I’ve been able to eliminate some possibilities. Before, I wasn’t sure if it was the knife or the revolver or the wrench. If a knife was maybe, a revolver was maybe, wrench is maybe. Now I’m down to just the knife and the wrench. Between those two, I don’t know which one it is. They’re both maybes. But I’ve been able to eliminate the revolver, which is one that I know to be false, because I have the revolver card. And so additional information might be acquired over the course of this game. And we would represent that just by adding knowledge to our knowledge set or knowledge base that we’ve been building here. So if, for example, we additionally got the information that someone made a guess, someone guessed like Miss Scarlet in the library with the wrench. And we know that a card was revealed, which means that one of those three cards, either Miss Scarlet or the library or the wrench, one of those at minimum must not be inside of the envelope. So I could add some knowledge, say knowledge.add. And I’m going to add an or clause, because I don’t know for sure which one it’s not, but I know one of them is not in the envelope. So it’s either not Scarlet, or it’s not the library, and or supports multiple arguments. I can say it’s also or not the wrench. So at least one of those needs a Scarlet library and wrench. At least one of those needs to be false. I don’t know which, though. Maybe it’s multiple. Maybe it’s just one, but at least one I know needs to hold. And so now if I rerun clue.py, I don’t actually have any additional information just yet. Nothing I can say conclusively. I still know that maybe it’s Professor Plum, maybe it’s Miss Scarlet. I haven’t eliminated any options. But let’s imagine that I get some more information, that someone shows me the Professor Plum card, for example. So I say, all right, let’s go back here, knowledge.add, not Plum. So I have the Professor Plum card. I know the Professor Plum is not in the middle. I rerun clue.py. And right now, I’m able to draw some conclusions. Now I’ve been able to eliminate Professor Plum, and the only person it could left remaining be is Miss Scarlet. So I know, yes, Miss Scarlet, this variable must be true. And I’ve been able to infer that based on the information I already had. Now between the ballroom and the library and the knife and the wrench, for those two, I’m still not sure. So let’s add one more piece of information. Let’s say that I know that it’s not the ballroom. Someone has shown me the ballroom card, so I know it’s not the ballroom. Which means at this point, I should be able to conclude that it’s the library. Let’s see. I’ll say knowledge.add, not the ballroom. And we’ll go ahead and run that. And it turns out that after all of this, not only can I conclude that I know that it’s the library, but I also know that the weapon was the knife. And that might have been an inference that was a little bit trickier, something I wouldn’t have realized immediately, but the AI, via this model checking algorithm, is able to draw that conclusion, that we know for sure that it must be Miss Scarlet in the library with the knife. And how did we know that? Well, we know it from this or clause up here, that we know that it’s either not Scarlet, or it’s not the library, or it’s not the wrench. And given that we know that it is Miss Scarlet, and we know that it is the library, then the only remaining option for the weapon is that it is not the wrench, which means that it must be the knife. So we as humans now can go back and reason through that, even though it might not have been immediately clear. And that’s one of the advantages of using an AI or some sort of algorithm in order to do this, is that the computer can exhaust all of these possibilities and try and figure out what the solution actually should be. And so for that reason, it’s often helpful to be able to represent knowledge in this way. Knowledge engineering, some situation where we can use a computer to be able to represent knowledge and draw conclusions based on that knowledge. And any time we can translate something into propositional logic symbols like this, this type of approach can be useful. So you might be familiar with logic puzzles, where you have to puzzle your way through trying to figure something out. This is what a classic logic puzzle might look like. Something like Gilderoy, Minerva, Pomona, and Horace each belong to a different one of the four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. And then we have some information. The Gilderoy belongs to Gryffindor or Ravenclaw, Pomona does not belong in Slytherin, and Minerva does belong to Gryffindor. So we have a couple pieces of information. And using that information, we need to be able to draw some conclusions about which person should be assigned to which house. And again, we can use the exact same idea to try and implement this notion. So we need some propositional symbols. And in this case, the propositional symbols are going to get a little more complex, although we’ll see ways to make this a little bit cleaner later on. But we’ll need 16 propositional symbols, one for each person and house. So we need to say, remember, every propositional symbol is either true or false. So Gilderoy Gryffindor is either true or false. Either he’s in Gryffindor or he is not. Likewise, Gilderoy Hufflepuff also true or false. Either it is true or it’s false. And that’s true for every combination of person and house that we could come up with. We have some sort of propositional symbol for each one of those. Using this type of knowledge, we can then begin to think about what types of logical sentences we can say about the puzzle. That if we know what will before even think about the information we were given, we can think about the premise of the problem, that every person is assigned to a different house. So what does that tell us? Well, it tells us sentences like this. It tells us like Pomona Slytherin implies not Pomona Hufflepuff. Something like if Pomona is in Slytherin, then we know that Pomona is not in Hufflepuff. And we know this for all four people and for all combinations of houses, that no matter what person you pick, if they’re in one house, then they’re not in some other house. So I’ll probably have a whole bunch of knowledge statements that are of this form, that if we know Pomona is in Slytherin, then we know Pomona is not in Hufflepuff. We were also given the information that each person is in a different house. So I also have pieces of knowledge that look something like this. Minerva Ravenclaw implies not Gilderoy Ravenclaw. If they’re all in different houses, then if Minerva is in Ravenclaw, then we know the Gilderoy is not in Ravenclaw as well. And I have a whole bunch of similar sentences like this that are expressing that idea for other people and other houses as well. And so in addition to sentences of these form, I also have the knowledge that was given to me. Information like Gilderoy was in Gryffindor or in Ravenclaw that would be represented like this, Gilderoy Gryffindor or Gilderoy Ravenclaw. And then using these sorts of sentences, I can begin to draw some conclusions about the world. So let’s see an example of this. We’ll go ahead and actually try and implement this logic puzzle to see if we can figure out what the answer is. I’ll go ahead and open up puzzle.py, where I’ve already started to implement this sort of idea. I’ve defined a list of people and a list of houses. And I’ve so far created one symbol for every person and for every house. That’s what this double four loop is doing, looping over all people, looping over all houses, creating a new symbol for each of them. And then I’ve added some information. I know that every person belongs to a house, so I’ve added the information for every person that person Gryffindor or person Hufflepuff or person Ravenclaw or person Slytherin, that one of those four things must be true. Every person belongs to a house. What other information do I know? I also know that only one house per person, so no person belongs to multiple houses. So how does this work? Well, this is going to be true for all people. So I’ll loop over every person. And then I need to loop over all different pairs of houses. The idea is I want to encode the idea that if Minerva is in Gryffindor, then Minerva can’t be in Ravenclaw. So I’ll loop over all houses, each one. And I’ll loop over all houses again, h2. And as long as they’re different, h1 not equal to h2, then I’ll add to my knowledge base this piece of information. That implication, in other words, an if then, if the person is in h1, then I know that they are not in house h2. So these lines here are encoding the notion that for every person, if they belong to house one, then they are not in house two. And the other piece of logic we need to encode is the idea that every house can only have one person. In other words, if Pomona is in Hufflepuff, then nobody else is allowed to be in Hufflepuff either. And that’s the same logic, but sort of backwards. I loop over all of the houses and loop over all different pairs of people. So I loop over people once, loop over people again, and only do this when the people are different, p1 not equal to p2. And I add the knowledge that if, as given by the implication, if person one belongs to the house, then it is not the case that person two belongs to the same house. So here I’m just encoding the knowledge that represents the problem’s constraints. I know that everyone’s in a different house. I know that any person can only belong to one house. And I can now take my knowledge and try and print out the information that I happen to know. So I’ll go ahead and print out knowledge.formula, just to see this in action, and I’ll go ahead and skip this for now. But we’ll come back to this in a second. Let’s print out the knowledge that I know by running Python puzzle.py. It’s a lot of information, a lot that I have to scroll through, because there are 16 different variables all going on. But the basic idea, if we scroll up to the very top, is I see my initial information. Gilderoy is either in Gryffindor, or Gilderoy is in Hufflepuff, or Gilderoy is in Ravenclaw, or Gilderoy is in Slytherin, and then way more information as well. So this is quite messy, more than we really want to be looking at. And soon, too, we’ll see ways of representing this a little bit more nicely using logic. But for now, we can just say these are the variables that we’re dealing with. And now we’d like to add some information. So the information we’re going to add is Gilderoy is in Gryffindor, or he is in Ravenclaw. So that knowledge was given to us. So I’ll go ahead and say knowledge.add. And I know that either or Gilderoy Gryffindor or Gilderoy Ravenclaw. One of those two things must be true. I also know that Pomona was not in Slytherin, so I can say knowledge.add not this symbol, not the Pomona-Slytherin symbol. And then I can add the knowledge that Minerva is in Gryffindor by adding the symbol Minerva Gryffindor. So those are the pieces of knowledge that I know. And this loop here at the bottom just loops over all of my symbols, checks to see if the knowledge entails that symbol by calling this model check function again. And if it does, if we know the symbol is true, we print out the symbol. So now I can run Python, puzzle.py, and Python is going to solve this puzzle for me. We’re able to conclude that Gilderoy belongs to Ravenclaw, Pomona belongs to Hufflepuff, Minerva to Gryffindor, and Horace to Slytherin just by encoding this knowledge inside the computer, although it was quite tedious to do in this case. And as a result, we were able to get the conclusion from that as well. And you can imagine this being applied to many sorts of different deductive situations. So not only these situations where we’re trying to deal with Harry Potter characters in this puzzle, but if you’ve ever played games like Mastermind, where you’re trying to figure out which order different colors go in and trying to make predictions about it, I could tell you, for example, let’s play a simplified version of Mastermind where there are four colors, red, blue, green, and yellow, and they’re in some order, but I’m not telling you what order. You just have to make a guess, and I’ll tell you of red, blue, green, and yellow how many of the four you got in the right position. So a simplified version of this game, you might make a guess like red, blue, green, yellow, and I would tell you something like two of those four are in the correct position, but the other two are not. And then you could reasonably make a guess and say, all right, look at this, blue, red, green, yellow. Try switching two of them around, and this time maybe I tell you, you know what, none of those are in the correct position. And the question then is, all right, what is the correct order of these four colors? And we as humans could begin to reason this through. All right, well, if none of these were correct, but two of these were correct, well, it must have been because I switched the red and the blue, which means red and blue here must be correct, which means green and yellow are probably not correct. You can begin to do this sort of deductive reasoning. And we can also equivalently try and take this and encode it inside of our computer as well. And it’s going to be very similar to the logic puzzle that we just did a moment ago. So I won’t spend too much time on this code because it is fairly similar. But again, we have a whole bunch of colors and four different positions in which those colors can be. And then we have some additional knowledge. And I encode all of that knowledge. And you can take a look at this code on your own time. But I just want to demonstrate that when we run this code, run python mastermind.py and run and see what we get, we ultimately are able to compute red 0 in the 0 position, blue in the 1 position, yellow in the 2 position, and green in the 3 position as the ordering of those symbols. Now, ultimately, what you might have noticed is this process was taking quite a long time. And in fact, model checking is not a particularly efficient algorithm, right? What I need to do in order to model check is take all of my possible different variables and enumerate all of the possibilities that they could be in. If I have n variables, I have 2 to the n possible worlds that I need to be looking through in order to perform this model checking algorithm. And this is probably not tractable, especially as we start to get to much larger and larger sets of data where you have many, many more variables that are at play. Right here, we only have a relatively small number of variables. So this sort of approach can actually work. But as the number of variables increases, model checking becomes less and less good of a way of trying to solve these sorts of problems. So while it might have been OK for something like Mastermind to conclude that this is indeed the correct sequence where all four are in the correct position, what we’d like to do is come up with some better ways to be able to make inferences rather than just enumerate all of the possibilities. And to do so, what we’ll transition to next is the idea of inference rules, some sort of rules that we can apply to take knowledge that already exists and translate it into new forms of knowledge. And the general way we’ll structure an inference rule is by having a horizontal line here. Anything above the line is going to represent a premise, something that we know to be true. And then anything below the line will be the conclusion that we can arrive at after we apply the logic from the inference rule that we’re going to demonstrate. So we’ll do some of these inference rules by demonstrating them in English first, but then translating them into the world of propositional logic so you can see what those inference rules actually look like. So for example, let’s imagine that I have access to two pieces of information. I know, for example, that if it is raining, then Harry is inside, for example. And let’s say I also know it is raining. Then most of us could reasonably then look at this information and conclude that, all right, Harry must be inside. This inference rule is known as modus ponens, and it’s phrased more formally in logic as this. If we know that alpha implies beta, in other words, if alpha, then beta, and we also know that alpha is true, then we should be able to conclude that beta is also true. We can apply this inference rule to take these two pieces of information and generate this new piece of information. Notice that this is a totally different approach from the model checking approach, where the approach was look at all of the possible worlds and see what’s true in each of these worlds. Here, we’re not dealing with any specific world. We’re just dealing with the knowledge that we know and what conclusions we can arrive at based on that knowledge. That I know that A implies B, and I know A, and the conclusion is B. And this should seem like a relatively obvious rule. But of course, if alpha, then beta, and we know alpha, then we should be able to conclude that beta is also true. And that’s going to be true for many, but maybe even all of the inference rules that we’ll take a look at. You should be able to look at them and say, yeah, of course that’s going to be true. But it’s putting these all together, figuring out the right combination of inference rules that can be applied that ultimately is going to allow us to generate interesting knowledge inside of our AI. So that’s modus ponensis application of implication, that if we know alpha and we know that alpha implies beta, then we can conclude beta. Let’s take a look at another example. Fairly straightforward, something like Harry is friends with Ron and Hermione. Based on that information, we can reasonably conclude Harry is friends with Hermione. That must also be true. And this inference rule is known as and elimination. And what and elimination says is that if we have a situation where alpha and beta are both true, I have information alpha and beta, well then, just alpha is true. Or likewise, just beta is true. That if I know that both parts are true, then one of those parts must also be true. Again, something obvious from the point of view of human intuition, but a computer needs to be told this kind of information. To be able to apply the inference rule, we need to tell the computer that this is an inference rule that you can apply, so the computer has access to it and is able to use it in order to translate information from one form to another. In addition to that, let’s take a look at another example of an inference rule, something like it is not true that Harry did not pass the test. Bit of a tricky sentence to parse. I’ll read it again. It is not true, or it is false, that Harry did not pass the test. Well, if it is false that Harry did not pass the test, then the only reasonable conclusion is that Harry did pass the test. And so this, instead of being and elimination, is what we call double negation elimination. That if we have two negatives inside of our premise, then we can just remove them altogether. They cancel each other out. One turns true to false, and the other one turns false back into true. Phrased a little bit more formally, we say that if the premise is not alpha, then the conclusion we can draw is just alpha. We can say that alpha is true. We’ll take a look at a couple more of these. If I have it is raining, then Harry is inside. How do I reframe this? Well, this one is a little bit trickier. But if I know if it is raining, then Harry is inside, then I conclude one of two things must be true. Either it is not raining, or Harry is inside. Now, this one’s trickier. So let’s think about it a little bit. This first premise here, if it is raining, then Harry is inside, is saying that if I know that it is raining, then Harry must be inside. So what is the other possible case? Well, if Harry is not inside, then I know that it must not be raining. So one of those two situations must be true. Either it’s not raining, or it is raining, in which case Harry is inside. So the conclusion I can draw is either it is not raining, or it is raining, so therefore, Harry is inside. And so this is a way to translate if-then statements into or statements. And this is known as implication elimination. And this is similar to what we actually did in the beginning when we were first looking at those very first sentences about Harry and Hagrid and Dumbledore. And phrased a little bit more formally, this says that if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta, because there are only two possibilities. Either alpha is true or alpha is not true. So one of those possibilities is alpha is not true. But if alpha is true, well, then we can draw the conclusion that beta must be true. So either alpha is not true or alpha is true, in which case beta is also true. So this is one way to turn an implication into just a statement about or. In addition to eliminating implications, we can also eliminate biconditionals as well. So let’s take an English example, something like, it is raining if and only if Harry is inside. And this if and only if really sounds like that biconditional, that double arrow sign that we saw in propositional logic not too long ago. And what does this actually mean if we were to translate this? Well, this means that if it is raining, then Harry is inside. And if Harry is inside, then it is raining, that this implication goes both ways. And this is what we would call biconditional elimination, that I can take a biconditional, a if and only if b, and translate that into something like this, a implies b, and b implies a. So many of these inference rules are taking logic that uses certain symbols and turning them into different symbols, taking an implication and turning it into an or, or taking a biconditional and turning it into implication. And another example of it would be something like this. It is not true that both Harry and Ron passed the test. Well, all right, how do we translate that? What does that mean? Well, if it is not true that both of them passed the test, well, then the reasonable conclusion we might draw is that at least one of them didn’t pass the test. So the conclusion is either Harry did not pass the test or Ron did not pass the test, or both. This is not an exclusive or. But if it is true that it is not true that both Harry and Ron passed the test, well, then either Harry didn’t pass the test or Ron didn’t pass the test. And this type of law is one of De Morgan’s laws. Quite famous in logic where the idea is that we can turn an and into an or. We can say we can take this and that both Harry and Ron passed the test and turn it into an or by moving the nots around. So if it is not true that Harry and Ron passed the test, well, then either Harry did not pass the test or Ron did not pass the test either. And the way we frame that more formally using logic is to say this. If it is not true that alpha and beta, well, then either not alpha or not beta. The way I like to think about this is that if you have a negation in front of an and expression, you move the negation inwards, so to speak, moving the negation into each of these individual sentences and then flip the and into an or. So the negation moves inwards and the and flips into an or. So I go from not a and b to not a or not b. And there’s actually a reverse of De Morgan’s law that goes in the other direction for something like this. If I say it is not true that Harry or Ron passed the test, meaning neither of them passed the test, well, then the conclusion I can draw is that Harry did not pass the test and Ron did not pass the test. So in this case, instead of turning an and into an or, we’re turning an or into an and. But the idea is the same. And this, again, is another example of De Morgan’s laws. And the way that works is that if I have not a or b this time, the same logic is going to apply. I’m going to move the negation inwards. And I’m going to flip this time, flip the or into an and. So if not a or b, meaning it is not true that a or b or alpha or beta, then I can say not alpha and not beta, moving the negation inwards in order to make that conclusion. So those are De Morgan’s laws and a couple other inference rules that are worth just taking a look at. One is the distributive law that works this way. So if I have alpha and beta or gamma, well, then much in the same way that you can use in math, use distributive laws to distribute operands like addition and multiplication, I can do a similar thing here, where I can say if alpha and beta or gamma, then I can say something like alpha and beta or alpha and gamma, that I’ve been able to distribute this and sign throughout this expression. So this is an example of the distributive property or the distributive law as applied to logic in much the same way that you would distribute a multiplication over the addition of something, for example. This works the other way too. So if, for example, I have alpha or beta and gamma, I can distribute the or throughout the expression. I can say alpha or beta and alpha or gamma. So the distributive law works in that way too. And it’s helpful if I want to take an or and move it into the expression. And we’ll see an example soon of why it is that we might actually care to do something like that. All right, so now we’ve seen a lot of different inference rules. And the question now is, how can we use those inference rules to actually try and draw some conclusions, to actually try and prove something about entailment, proving that given some initial knowledge base, we would like to find some way to prove that a query is true? Well, one way to think about it is actually to think back to what we talked about last time when we talked about search problems. Recall again that search problems have some sort of initial state. They have actions that you can take from one state to another as defined by a transition model that tells you how to get from one state to another. We talked about testing to see if you were at a goal. And then some path cost function to see how many steps did you have to take or how costly was the solution that you found. Now that we have these inference rules that take some set of sentences in propositional logic and get us some new set of sentences in propositional logic, we can actually treat those sentences or those sets of sentences as states inside of a search problem. So if we want to prove that some query is true, prove that some logical theorem is true, we can treat theorem proving as a form of a search problem. I can say that we begin in some initial state, where that initial state is the knowledge base that I begin with, the set of all of the sentences that I know to be true. What actions are available to me? Well, the actions are any of the inference rules that I can apply at any given time. The transition model just tells me after I apply the inference rule, here is the new set of all of the knowledge that I have, which will be the old set of knowledge, plus some additional inference that I’ve been able to draw, much as in the same way we saw what we got when we applied those inference rules and got some sort of conclusion. That conclusion gets added to our knowledge base, and our transition model will encode that. What is the goal test? Well, our goal test is checking to see if we have proved the statement we’re trying to prove, if the thing we’re trying to prove is inside of our knowledge base. And the path cost function, the thing we’re trying to minimize, is maybe the number of inference rules that we needed to use, the number of steps, so to speak, inside of our proof. And so here we’ve been able to apply the same types of ideas that we saw last time with search problems to something like trying to prove something about knowledge by taking our knowledge and framing it in terms that we can understand as a search problem with an initial state, with actions, with a transition model. So this shows a couple of things, one being how versatile search problems are, that they can be the same types of algorithms that we use to solve a maze or figure out how to get from point A to point B inside of driving directions, for example, can also be used as a theorem proving method of taking some sort of starting knowledge base and trying to prove something about that knowledge. So this, yet again, is a second way, in addition to model checking, to try and prove that certain statements are true. But it turns out there’s yet another way that we can try and apply inference. And we’ll talk about this now, which is not the only way, but certainly one of the most common, which is known as resolution. And resolution is based on another inference rule that we’ll take a look at now, quite a powerful inference rule that will let us prove anything that can be proven about a knowledge base. And it’s based on this basic idea. Let’s say I know that either Ron is in the Great Hall or Hermione is in the library. And let’s say I also know that Ron is not in the Great Hall. Based on those two pieces of information, what can I conclude? Well, I could pretty reasonably conclude that Hermione must be in the library. How do I know that? Well, it’s because these two statements, these two what we’ll call complementary literals, literals that complement each other, they’re opposites of each other, seem to conflict with each other. This sentence tells us that either Ron is in the Great Hall or Hermione is in the library. So if we know that Ron is not in the Great Hall, that conflicts with this one, which means Hermione must be in the library. And this we can frame as a more general rule known as the unit resolution rule, a rule that says that if we have p or q and we also know not p, well then from that we can reasonably conclude q. That if p or q are true and we know that p is not true, the only possibility is for q to then be true. And this, it turns out, is quite a powerful inference rule in terms of what it can do, in part because we can quickly start to generalize this rule. This q right here doesn’t need to just be a single propositional symbol. It could be multiple, all chained together in a single clause, as we’ll call it. So if I had something like p or q1 or q2 or q3, so on and so forth, up until qn, so I had n different other variables, and I have not p, well then what happens when these two complement each other is that these two clauses resolve, so to speak, to produce a new clause that is just q1 or q2 all the way up to qn. And in an or, the order of the arguments in the or doesn’t actually matter. The p doesn’t need to be the first thing. It could have been in the middle. But the idea here is that if I have p in one clause and not p in the other clause, well then I know that one of these remaining things must be true. I’ve resolved them in order to produce a new clause. But it turns out we can generalize this idea even further, in fact, and display even more power that we can have with this resolution rule. So let’s take another example. Let’s say, for instance, that I know the same piece of information that either Ron is in the Great Hall or Hermione is in the library. And the second piece of information I know is that Ron is not in the Great Hall or Harry is sleeping. So it’s not just a single piece of information. I have two different clauses. And we’ll define clauses more precisely in just a moment. What do I know here? Well again, for any propositional symbol like Ron is in the Great Hall, there are only two possibilities. Either Ron is in the Great Hall, in which case, based on resolution, we know that Harry must be sleeping, or Ron is not in the Great Hall, in which case we know based on the same rule that Hermione must be in the library. Based on those two things in combination, I can say based on these two premises that I can conclude that either Hermione is in the library or Harry is sleeping. So again, because these two conflict with each other, I know that one of these two must be true. And you can take a closer look and try and reason through that logic. Make sure you convince yourself that you believe this conclusion. Stated more generally, we can name this resolution rule by saying that if we know p or q is true, and we also know that not p or r is true, we resolve these two clauses together to get a new clause, q or r, that either q or r must be true. And again, much as in the last case, q and r don’t need to just be single propositional symbols. It could be multiple symbols. So if I had a rule that had p or q1 or q2 or q3, so on and so forth, up until qn, where n is just some number. And likewise, I had not p or r1 or r2, so on and so forth, up until rm, where m, again, is just some other number. I can resolve these two clauses together to get one of these must be true, q1 or q2 up until qn or r1 or r2 up until rm. And this is just a generalization of that same rule we saw before. Each of these things here are what we’re going to call a clause, where a clause is formally defined as a disjunction of literals, where a disjunction means it’s a bunch of things that are connected with or. Disjunction means things connected with or. Conjunction, meanwhile, is things connected with and. And a literal is either a propositional symbol or the opposite of a propositional symbol. So it’s something like p or q or not p or not q. Those are all propositional symbols or not of the propositional symbols. And we call those literals. And so a clause is just something like this, p or q or r, for example. Meanwhile, what this gives us an ability to do is it gives us an ability to turn logic, any logical sentence, into something called conjunctive normal form. A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses. Recall, again, conjunction means things are connected to one another using and. And so a conjunction of clauses means it is an and of individual clauses, each of which has ors in it. So something like this, a or b or c, and d or not e, and f or g. Everything in parentheses is one clause. All of the clauses are connected to each other using an and. And everything in the clause is separated using an or. And this is just a standard form that we can translate a logical sentence into that just makes it easy to work with and easy to manipulate. And it turns out that we can take any sentence in logic and turn it into conjunctive normal form just by applying some inference rules and transformations to it. So we’ll take a look at how we can actually do that. So what is the process for taking a logical formula and converting it into conjunctive normal form, otherwise known as c and f? Well, the process looks a little something like this. We need to take all of the symbols that are not part of conjunctive normal form. The bi-conditionals and the implications and so forth, and turn them into something that is more closely like conjunctive normal form. So the first step will be to eliminate bi-conditionals, those if and only if double arrows. And we know how to eliminate bi-conditionals because we saw there was an inference rule to do just that. Any time I have an expression like alpha if and only if beta, I can turn that into alpha implies beta and beta implies alpha based on that inference rule we saw before. Likewise, in addition to eliminating bi-conditionals, I can eliminate implications as well, the if then arrows. And I can do that using the same inference rule we saw before too, taking alpha implies beta and turning that into not alpha or beta because that is logically equivalent to this first thing here. Then we can move knots inwards because we don’t want knots on the outsides of our expressions. Conjunctive normal form requires that it’s just claws and claws and claws and claws. Any knots need to be immediately next to propositional symbols. But we can move those knots around using De Morgan’s laws by taking something like not A and B and turn it into not A or not B, for example, using De Morgan’s laws to manipulate that. And after that, all we’ll be left with are ands and ors. And those are easy to deal with. We can use the distributive law to distribute the ors so that the ors end up on the inside of the expression, so to speak, and the ands end up on the outside. So this is the general pattern for how we’ll take a formula and convert it into conjunctive normal form. And let’s now take a look at an example of how we would do this and explore then why it is that we would want to do something like this. Here’s how we can do it. Let’s take this formula, for example. P or Q implies R. And I’d like to convert this into conjunctive normal form, where it’s all ands of clauses, and every clause is a disjunctive clause. It’s ors together. So what’s the first thing I need to do? Well, this is an implication. So let me go ahead and remove that implication. Using the implication inference rule, I can turn P or Q into P or Q implies R into not P or Q or R. So that’s the first step. I’ve gotten rid of the implication. And next, I can get rid of the not on the outside of this expression, too. I can move the nots inwards so they’re closer to the literals themselves by using De Morgan’s laws. And De Morgan’s law says that not P or Q is equivalent to not P and not Q. Again, here, just applying the inference rules that we’ve already seen in order to translate these statements. And now, I have two things that are separated by an or, where this thing on the inside is an and. What I’d really like to move the ors so the ors are on the inside, because conjunctive normal form means I need clause and clause and clause and clause. And so to do that, I can use the distributive law. If I have not P and not Q or R, I can distribute the or R to both of these to get not P or R and not Q or R using the distributive law. And this now here at the bottom is in conjunctive normal form. It is a conjunction and and of disjunctions of clauses that just are separated by ors. So this process can be used by any formula to take a logical sentence and turn it into this conjunctive normal form, where I have clause and clause and clause and clause and clause and so on. So why is this helpful? Why do we even care about taking all these sentences and converting them into this form? It’s because once they’re in this form where we have these clauses, these clauses are the inputs to the resolution inference rule that we saw a moment ago, that if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion. And we call this process inference by resolution, using the resolution rule to draw some sort of inference. And it’s based on the same idea, that if I have P or Q, this clause, and I have not P or R, that I can resolve these two clauses together to get Q or R as the resulting clause, a new piece of information that I didn’t have before. Now, a couple of key points that are worth noting about this before we talk about the actual algorithm. One thing is that, let’s imagine we have P or Q or S, and I also have not P or R or S. The resolution rule says that because this P conflicts with this not P, we would resolve to put everything else together to get Q or S or R or S. But it turns out that this double S is redundant, or S here and or S there. It doesn’t change the meaning of the sentence. So in resolution, when we do this resolution process, we’ll usually also do a process known as factoring, where we take any duplicate variables that show up and just eliminate them. So Q or S or R or S just becomes Q or R or S. The S only needs to appear once, no need to include it multiple times. Now, one final question worth considering is what happens if I try to resolve P and not P together? If I know that P is true and I know that not P is true, well, resolution says I can merge these clauses together and look at everything else. Well, in this case, there is nothing else, so I’m left with what we might call the empty clause. I’m left with nothing. And the empty clause is always false. The empty clause is equivalent to just being false. And that’s pretty reasonable because it’s impossible for both P and not P to both hold at the same time. P is either true or it’s not true, which means that if P is true, then this must be false. And if this is true, then this must be false. There is no way for both of these to hold at the same time. So if ever I try and resolve these two, it’s a contradiction, and I’ll end up getting this empty clause where the empty clause I can call equivalent to false. And this idea that if I resolve these two contradictory terms, I get the empty clause, this is the basis for our inference by resolution algorithm. Here’s how we’re going to perform inference by resolution at a very high level. We want to prove that our knowledge base entails some query alpha, that based on the knowledge we have, we can prove conclusively that alpha is going to be true. How are we going to do that? Well, in order to do that, we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction. And this is a common technique in computer science more generally, this idea of proving something by contradiction. If I want to prove that something is true, I can do so by first assuming that it is false and showing that it would be contradictory, showing that it leads to some contradiction. And if the thing I’m trying to prove, if when I assume it’s false, leads to a contradiction, then it must be true. And that’s the logical approach or the idea behind a proof by contradiction. And that’s what we’re going to do here. We want to prove that this query alpha is true. So we’re going to assume that it’s not true. We’re going to assume not alpha. And we’re going to try and prove that it’s a contradiction. If we do get a contradiction, well, then we know that our knowledge entails the query alpha. If we don’t get a contradiction, there is no entailment. This is this idea of a proof by contradiction of assuming the opposite of what you’re trying to prove. And if you can demonstrate that that’s a contradiction, then what you’re proving must be true. But more formally, how do we actually do this? How do we check that knowledge base and not alpha is going to lead to a contradiction? Well, here is where resolution comes into play. To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form, that form where we have a whole bunch of clauses that are all anded together. And when we have these individual clauses, now we can keep checking to see if we can use resolution to produce a new clause. We can take any pair of clauses and check, is there some literal that is the opposite of each other or complementary to each other in both of them? For example, I have a p in one clause and a not p in another clause. Or an r in one clause and a not r in another clause. If ever I have that situation where once I convert to conjunctive normal form and I have a whole bunch of clauses, I see two clauses that I can resolve to produce a new clause, then I’ll do so. This process occurs in a loop. I’m going to keep checking to see if I can use resolution to produce a new clause and keep using those new clauses to try to generate more new clauses after that. Now, it just so may happen that eventually we may produce the empty clause, the clause we were talking about before. If I resolve p and not p together, that produces the empty clause and the empty clause we know to be false. Because we know that there’s no way for both p and not p to both simultaneously be true. So if ever we produce the empty clause, then we have a contradiction. And if we have a contradiction, that’s exactly what we were trying to do in a fruit by contradiction. If we have a contradiction, then we know that our knowledge base must entail this query alpha. And we know that alpha must be true. And it turns out, and we won’t go into the proof here, but you can show that otherwise, if you don’t produce the empty clause, then there is no entailment. If we run into a situation where there are no more new clauses to add, we’ve done all the resolution that we can do, and yet we still haven’t produced the empty clause, then there is no entailment in this case. And this now is the resolution algorithm. And it’s very abstract looking, especially this idea of like, what does it even mean to have the empty clause? So let’s take a look at an example, actually try and prove some entailment by using this inference by resolution process. So here’s our question. We have this knowledge base. Here is the knowledge that we know, A or B, and not B or C, and not C. And we want to know if all of this entails A. So this is our knowledge base here, this whole log thing. And our query alpha is just this propositional symbol, A. So what do we do? Well, first, we want to prove by contradiction. So we want to first assume that A is false, and see if that leads to some sort of contradiction. So here is what we’re going to start with, A or B, and not B or C, and not C. This is our knowledge base. And we’re going to assume not A. We’re going to assume that the thing we’re trying to prove is, in fact, false. And so this is now in conjunctive normal form, and I have four different clauses. I have A or B. I have not B or C. I have not C, and I have not A. And now, I can begin to just pick two clauses that I can resolve, and apply the resolution rule to them. And so looking at these four clauses, I see, all right, these two clauses are ones I can resolve. I can resolve them because there are complementary literals that show up in them. There’s a C here, and a not C here. So just looking at these two clauses, if I know that not B or C is true, and I know that C is not true, well, then I can resolve these two clauses to say, all right, not B, that must be true. I can generate this new clause as a new piece of information that I now know to be true. And all right, now I can repeat this process, do the process again. Can I use resolution again to get some new conclusion? Well, it turns out I can. I can use that new clause I just generated, along with this one here. There are complementary literals. This B is complementary to, or conflicts with, this not B over here. And so if I know that A or B is true, and I know that B is not true, well, then the only remaining possibility is that A must be true. So now we have A. That is a new clause that I’ve been able to generate. And now, I can do this one more time. I’m looking for two clauses that can be resolved, and you might programmatically do this by just looping over all possible pairs of clauses and checking for complementary literals in each. And here, I can say, all right, I found two clauses, not A and A, that conflict with each other. And when I resolve these two together, well, this is the same as when we were resolving P and not P from before. When I resolve these two clauses together, I get rid of the As, and I’m left with the empty clause. And the empty clause we know to be false, which means we have a contradiction, which means we can safely say that this whole knowledge base does entail A. That if this sentence is true, that we know that A for sure is also true. So this now, using inference by resolution, is an entirely different way to take some statement and try and prove that it is, in fact, true. Instead of enumerating all of the possible worlds that we might be in in order to try to figure out in which cases is the knowledge base true and in which cases are query true, instead we use this resolution algorithm to say, let’s keep trying to figure out what conclusions we can draw and see if we reach a contradiction. And if we reach a contradiction, then that tells us something about whether our knowledge actually entails the query or not. And it turns out there are many different algorithms that can be used for inference. What we’ve just looked at here are just a couple of them. And in fact, all of this is just based on one particular type of logic. It’s based on propositional logic, where we have these individual symbols and we connect them using and and or and not and implies and by conditionals. But propositional logic is not the only kind of logic that exists. And in fact, we see that there are limitations that exist in propositional logic, especially as we saw in examples like with the mastermind example or with the example with the logic puzzle where we had different Hogwarts house people that belong to different houses and we were trying to figure out who belonged to which houses. There were a lot of different propositional symbols that we needed in order to represent some fairly basic ideas. So now is the final topic that we’ll take a look at just before we end class today is one final type of logic different from propositional logic known as first order logic, which is a little bit more powerful than propositional logic and is going to make it easier for us to express certain types of ideas. In propositional logic, if we think back to that puzzle with the people in the Hogwarts houses, we had a whole bunch of symbols. And every symbol could only be true or false. We had a symbol for Minerva Gryffindor, which was either true of Minerva within Gryffindor and false otherwise, and likewise for Minerva Hufflepuff and Minerva Ravenclaw and Minerva Slytherin and so forth. But this was starting to get quite redundant. We wanted some way to be able to express that there is a relationship between these propositional symbols, that Minerva shows up in all of them. And also, I would have liked to have not have had so many different symbols to represent what really was a fairly straightforward problem. So first order logic will give us a different way of trying to deal with this idea by giving us two different types of symbols. We’re going to have constant symbols that are going to represent objects like people or houses. And then predicate symbols, which you can think of as relations or functions that take an input and evaluate them to true or false, for example, that tell us whether or not some property of some constant or some pair of constants or multiple constants actually holds. So we’ll see an example of that in just a moment. For now, in this same problem, our constant symbols might be objects, things like people or houses. So Minerva, Pomona, Horace, Gilderoy, those are all constant symbols, as are my four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. Predicates, meanwhile, these predicate symbols are going to be properties that might hold true or false of these individual constants. So person might hold true of Minerva, but it would be false for Gryffindor because Gryffindor is not a person. And house is going to hold true for Ravenclaw, but it’s not going to hold true for Horace, for example, because Horace is a person. And belongs to, meanwhile, is going to be some relation that is going to relate people to their houses. And it’s going to only tell me when someone belongs to a house or does not. So let’s take a look at some examples of what a sentence in first order logic might actually look like. A sentence might look like something like this. Person Minerva, with Minerva in parentheses, and person being a predicate symbol, Minerva being a constant symbol. This sentence in first order logic effectively means Minerva is a person, or the person property applies to the Minerva object. So if I want to say something like Minerva is a person, here is how I express that idea using first order logic. Meanwhile, I can say something like, house Gryffindor, to likewise express the idea that Gryffindor is a house. I can do that this way. And all of the same logical connectives that we saw in propositional logic, those are going to work here too. And or implication by conditional not. In fact, I can use not to say something like, not house Minerva. And this sentence in first order logic means something like, Minerva is not a house. It is not true that the house property applies to Minerva. Meanwhile, in addition to some of these predicate symbols that just take a single argument, some of our predicate symbols are going to express binary relations, relations between two of its arguments. So I could say something like, belongs to, and then two inputs, Minerva and Gryffindor, to express the idea that Minerva belongs to Gryffindor. And so now here’s the key difference, or one of the key differences, between this and propositional logic. In propositional logic, I needed one symbol for Minerva Gryffindor, and one symbol for Minerva Hufflepuff, and one symbol for all the other people’s Gryffindor and Hufflepuff variables. In this case, I just need one symbol for each of my people, and one symbol for each of my houses. And then I can express as a predicate something like, belongs to, and say, belongs to Minerva Gryffindor, to express the idea that Minerva belongs to Gryffindor House. So already we can see that first order logic is quite expressive in being able to express these sorts of sentences using the existing constant symbols and predicates that already exist, while minimizing the number of new symbols that I need to create. I can just use eight symbols for people for houses, instead of 16 symbols for every possible combination of each. But first order logic gives us a couple of additional features that we can use to express even more complex ideas. And these more additional features are generally known as quantifiers. And there are two main quantifiers in first order logic, the first of which is universal quantification. Universal quantification lets me express an idea like something is going to be true for all values of a variable. Like for all values of x, some statement is going to hold true. So what might a sentence in universal quantification look like? Well, we’re going to use this upside down a to mean for all. So upside down ax means for all values of x, where x is any object, this is going to hold true. Belongs to x Gryffindor implies not belongs to x Hufflepuff. So let’s try and parse this out. This means that for all values of x, if this holds true, if x belongs to Gryffindor, then this does not hold true. x does not belong to Hufflepuff. So translated into English, this sentence is saying something like for all objects x, if x belongs to Gryffindor, then x does not belong to Hufflepuff, for example. Or a phrase even more simply, anyone in Gryffindor is not in Hufflepuff, simplified way of saying the same thing. So this universal quantification lets us express an idea like something is going to hold true for all values of a particular variable. In addition to universal quantification though, we also have existential quantification. Whereas universal quantification said that something is going to be true for all values of a variable, existential quantification says that some expression is going to be true for some value of a variable, at least one value of the variable. So let’s take a look at a sample sentence using existential quantification. One such sentence looks like this. There exists an x. This backwards e stands for exists. And here we’re saying there exists an x such that house x and belongs to Minerva x. In other words, there exists some object x where x is a house and Minerva belongs to x. Or phrased a little more succinctly in English, I’m here just saying Minerva belongs to a house. There’s some object that is a house and Minerva belongs to a house. And combining this universal and existential quantification, we can create far more sophisticated logical statements than we were able to just using propositional logic. I could combine these to say something like this. For all x, person x implies there exists a y such that house y and belongs to xy. All right. So a lot of stuff going on there, a lot of symbols. Let’s try and parse it out and just understand what it’s saying. Here we’re saying that for all values of x, if x is a person, then this is true. So in other words, I’m saying for all people, and we call that person x, this statement is going to be true. What statement is true of all people? Well, there exists a y that is a house, so there exists some house, and x belongs to y. In other words, I’m saying that for all people out there, there exists some house such that x, the person, belongs to y, the house. This is phrased more succinctly. I’m saying that every person belongs to a house, that for all x, if x is a person, then there exists a house that x belongs to. And so we can now express a lot more powerful ideas using this idea now of first order logic. And it turns out there are many other kinds of logic out there. There’s second order logic and other higher order logic, each of which allows us to express more and more complex ideas. But all of it, in this case, is really in pursuit of the same goal, which is the representation of knowledge. We want our AI agents to be able to know information, to represent that information, whether that’s using propositional logic or first order logic or some other logic, and then be able to reason based on that, to be able to draw conclusions, make inferences, figure out whether there’s some sort of entailment relationship, as by using some sort of inference algorithm, something like inference by resolution or model checking or any number of these other algorithms that we can use in order to take information that we know and translate it to additional conclusions. So all of this has helped us to create AI that is able to represent information about what it knows and what it doesn’t know. Next time, though, we’ll take a look at how we can make our AI even more powerful by not just encoding information that we know for sure to be true and not to be true, but also to take a look at uncertainty, to look at what happens if AI thinks that something might be probable or maybe not very probable or somewhere in between those two extremes, all in the pursuit of trying to build our intelligent systems to be even more intelligent. We’ll see you next time. Thank you. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. And last time, we took a look at how it is that AI inside of our computers can represent knowledge. We represented that knowledge in the form of logical sentences in a variety of different logical languages. And the idea was we wanted our AI to be able to represent knowledge or information and somehow use those pieces of information to be able to derive new pieces of information by inference, to be able to take some information and deduce some additional conclusions based on the information that it already knew for sure. But in reality, when we think about computers and we think about AI, very rarely are our machines going to be able to know things for sure. Oftentimes, there’s going to be some amount of uncertainty in the information that our AIs or our computers are dealing with, where it might believe something with some probability, as we’ll soon discuss what probability is all about and what it means, but not entirely for certain. And we want to use the information that it has some knowledge about, even if it doesn’t have perfect knowledge, to still be able to make inferences, still be able to draw conclusions. So you might imagine, for example, in the context of a robot that has some sensors and is exploring some environment, it might not know exactly where it is or exactly what’s around it, but it does have access to some data that can allow it to draw inferences with some probability. There’s some likelihood that one thing is true or another. Or you can imagine in context where there is a little bit more randomness and uncertainty, something like predicting the weather, where you might not be able to know for sure what tomorrow’s weather is with 100% certainty, but you can probably infer with some probability what tomorrow’s weather is going to be based on maybe today’s weather and yesterday’s weather and other data that you might have access to as well. And so oftentimes, we can distill this in terms of just possible events that might happen and what the likelihood of those events are. This comes a lot in games, for example, where there is an element of chance inside of those games. So you imagine rolling a dice. You’re not sure exactly what the die roll is going to be, but you know it’s going to be one of these possibilities from 1 to 6, for example. And so here now, we introduce the idea of probability theory. And what we’ll take a look at today is beginning by looking at the mathematical foundations of probability theory, getting an understanding for some of the key concepts within probability, and then diving into how we can use probability and the ideas that we look at mathematically to represent some ideas in terms of models that we can put into our computers in order to program an AI that is able to use information about probability to draw inferences, to make some judgments about the world with some probability or likelihood of being true. So probability ultimately boils down to this idea that there are possible worlds that we’re here representing using this little Greek letter omega. And the idea of a possible world is that when I roll a die, there are six possible worlds that could result from it. I could roll a 1, or a 2, or a 3, or a 4, or a 5, or a 6. And each of those are a possible world. And each of those possible worlds has some probability of being true, the probability that I do roll a 1, or a 2, or a 3, or something else. And we represent that probability like this, using the capital letter P. And then in parentheses, what it is that we want the probability of. So this right here would be the probability of some possible world as represented by the little letter omega. Now, there are a couple of basic axioms of probability that become relevant as we consider how we deal with probability and how we think about it. First and foremost, every probability value must range between 0 and 1 inclusive. So the smallest value any probability can have is the number 0, which is an impossible event. Something like I roll a die, and the die is a 7 is the roll that I get. If the die only has numbers 1 through 6, the event that I roll a 7 is impossible, so it would have probability 0. And on the other end of the spectrum, probability can range all the way up to the positive number 1, meaning an event is certain to happen, that I roll a die and the number is less than 10, for example. That is an event that is guaranteed to happen if the only sides on my die are 1 through 6, for instance. And then they can range through any real number in between these two values. Where, generally speaking, a higher value for the probability means an event is more likely to take place, and a lower value for the probability means the event is less likely to take place. And the other key rule for probability looks a little bit like this. This sigma notation, if you haven’t seen it before, refers to summation, the idea that we’re going to be adding up a whole sequence of values. And this sigma notation is going to come up a couple of times today, because as we deal with probability, oftentimes we’re adding up a whole bunch of individual values or individual probabilities to get some other value. So we’ll see this come up a couple of times. But what this notation means is that if I sum up all of the possible worlds omega that are in big omega, which represents the set of all the possible worlds, meaning I take for all of the worlds in the set of possible worlds and add up all of their probabilities, what I ultimately get is the number 1. So if I take all the possible worlds, add up what each of their probabilities is, I should get the number 1 at the end, meaning all probabilities just need to sum to 1. So for example, if I take dice, for example, and if you imagine I have a fair die with numbers 1 through 6 and I roll the die, each one of these rolls has an equal probability of taking place. And the probability is 1 over 6, for example. So each of these probabilities is between 0 and 1, 0 meaning impossible and 1 meaning for certain. And if you add up all of these probabilities for all of the possible worlds, you get the number 1. And we can represent any one of those probabilities like this. The probability that we roll the number 2, for example, is just 1 over 6. Every six times we roll the die, we’d expect that one time, for instance, the die might come up as a 2. Its probability is not certain, but it’s a little more than nothing, for instance. And so this is all fairly straightforward for just a single die. But things get more interesting as our models of the world get a little bit more complex. Let’s imagine now that we’re not just dealing with a single die, but we have two dice, for example. I have a red die here and a blue die there, and I care not just about what the individual roll is, but I care about the sum of the two rolls. In this case, the sum of the two rolls is the number 3. How do I begin to now reason about what does the probability look like if instead of having one die, I now have two dice? Well, what we might imagine is that we could first consider what are all of the possible worlds. And in this case, all of the possible worlds are just every combination of the red and blue die that I could come up with. For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6. And for each of those possibilities, the blue die, likewise, could also be either 1 or 2 or 3 or 4 or 5 or 6. And it just so happens that in this particular case, each of these possible combinations is equally likely. Equally likely are all of these various different possible worlds. That’s not always going to be the case. If you imagine more complex models that we could try to build and things that we could try to represent in the real world, it’s probably not going to be the case that every single possible world is always equally likely. But in the case of fair dice, where in any given die roll, any one number has just as good a chance of coming up as any other number, we can consider all of these possible worlds to be equally likely. But even though all of the possible worlds are equally likely, that doesn’t necessarily mean that their sums are equally likely. So if we consider what the sum is of all of these two, so 1 plus 1, that’s a 2. 2 plus 1 is a 3. And consider for each of these possible pairs of numbers what their sum ultimately is, we can notice that there are some patterns here, where it’s not entirely the case that every number comes up equally likely. If you consider 7, for example, what’s the probability that when I roll two dice, their sum is 7? There are several ways this can happen. There are six possible worlds where the sum is 7. It could be a 1 and a 6, or a 2 and a 5, or a 3 and a 4, a 4 and a 3, and so forth. But if you instead consider what’s the probability that I roll two dice, and the sum of those two die rolls is 12, for example, we’re looking at this diagram, there’s only one possible world in which that can happen. And that’s the possible world where both the red die and the blue die both come up as sixes to give us a sum total of 12. So based on just taking a look at this diagram, we see that some of these probabilities are likely different. The probability that the sum is a 7 must be greater than the probability that the sum is a 12. And we can represent that even more formally by saying, OK, the probability that we sum to 12 is 1 out of 36. Out of the 36 equally likely possible worlds, 6 squared because we have six options for the red die and six options for the blue die, out of those 36 options, only one of them sums to 12. Whereas on the other hand, the probability that if we take two dice rolls and they sum up to the number 7, well, out of those 36 possible worlds, there were six worlds where the sum was 7. And so we get 6 over 36, which we can simplify as a fraction to just 1 over 6. So here now, we’re able to represent these different ideas of probability, representing some events that might be more likely and then other events that are less likely as well. And these sorts of judgments, where we’re figuring out just in the abstract what is the probability that this thing takes place, are generally known as unconditional probabilities. Some degree of belief we have in some proposition, some fact about the world, in the absence of any other evidence. Without knowing any additional information, if I roll a die, what’s the chance it comes up as a 2? Or if I roll two dice, what’s the chance that the sum of those two die rolls is a 7? But usually when we’re thinking about probability, especially when we’re thinking about training in AI to intelligently be able to know something about the world and make predictions based on that information, it’s not unconditional probability that our AI is dealing with, but rather conditional probability, probability where rather than having no original knowledge, we have some initial knowledge about the world and how the world actually works. So conditional probability is the degree of belief in a proposition given some evidence that has already been revealed to us. So what does this look like? Well, it looks like this in terms of notation. We’re going to represent conditional probability as probability of A and then this vertical bar and then B. And the way to read this is the thing on the left-hand side of the vertical bar is what we want the probability of. Here now, I want the probability that A is true, that it is the real world, that it is the event that actually does take place. And then on the right side of the vertical bar is our evidence, the information that we already know for certain about the world. For example, that B is true. So the way to read this entire expression is what is the probability of A given B, the probability that A is true, given that we already know that B is true. And this type of judgment, conditional probability, the probability of one thing given some other fact, comes up quite a lot when we think about the types of calculations we might want our AI to be able to do. For example, we might care about the probability of rain today given that we know that it rained yesterday. We could think about the probability of rain today just in the abstract. What is the chance that today it rains? But usually, we have some additional evidence. I know for certain that it rained yesterday. And so I would like to calculate the probability that it rains today given that I know that it rained yesterday. Or you might imagine that I want to know the probability that my optimal route to my destination changes given the current traffic condition. So whether or not traffic conditions change, that might change the probability that this route is actually the optimal route. Or you might imagine in a medical context, I want to know the probability that a patient has a particular disease given some results of some tests that have been performed on that patient. And I have some evidence, the results of that test, and I would like to know the probability that a patient has a particular disease. So this notion of conditional probability comes up everywhere. So we begin to think about what we would like to reason about, but being able to reason a little more intelligently by taking into account evidence that we already have. We’re more able to get an accurate result for what is the likelihood that someone has this disease if we know this evidence, the results of the test, as opposed to if we were just calculating the unconditional probability of saying, what is the probability they have the disease without any evidence to try and back up our result one way or the other. So now that we’ve got this idea of what conditional probability is, the next question we have to ask is, all right, how do we calculate conditional probability? How do we figure out mathematically, if I have an expression like this, how do I get a number from that? What does conditional probability actually mean? Well, the formula for conditional probability looks a little something like this. The probability of a given b, the probability that a is true, given that we know that b is true, is equal to this fraction, the probability that a and b are true, divided by just the probability that b is true. And the way to intuitively try to think about this is that if I want to know the probability that a is true, given that b is true, well, I want to consider all the ways they could both be true out of the only worlds that I care about are the worlds where b is already true. I can sort of ignore all the cases where b isn’t true, because those aren’t relevant to my ultimate computation. They’re not relevant to what it is that I want to get information about. So let’s take a look at an example. Let’s go back to that example of rolling two dice and the idea that those two dice might sum up to the number 12. We discussed earlier that the unconditional probability that if I roll two dice and they sum to 12 is 1 out of 36, because out of the 36 possible worlds that I might care about, in only one of them is the sum of those two dice 12. It’s only when red is 6 and blue is also 6. But let’s say now that I have some additional information. I now want to know what is the probability that the two dice sum to 12, given that I know that the red die was a 6. So I already have some evidence. I already know the red die is a 6. I don’t know what the blue die is. That information isn’t given to me in this expression. But given the fact that I know that the red die rolled a 6, what is the probability that we sum to 12? And so we can begin to do the math using that expression from before. Here, again, are all of the possibilities, all of the possible combinations of red die being 1 through 6 and blue die being 1 through 6. And I might consider first, all right, what is the probability of my evidence, my B variable, where I want to know, what is the probability that the red die is a 6? Well, the probability that the red die is a 6 is just 1 out of 6. So these 1 out of 6 options are really the only worlds that I care about here now. All the rest of them are irrelevant to my calculation, because I already have this evidence that the red die was a 6, so I don’t need to care about all of the other possibilities that could result. So now, in addition to the fact that the red die rolled as a 6 and the probability of that, the other piece of information I need to know in order to calculate this conditional probability is the probability that both of my variables, A and B, are true. The probability that both the red die is a 6, and they all sum to 12. So what is the probability that both of these things happen? Well, it only happens in one possible case in 1 out of these 36 cases, and it’s the case where both the red and the blue die are equal to 6. This is a piece of information that we already knew. And so this probability is equal to 1 over 36. And so to get the conditional probability that the sum is 12, given that I know that the red dice is equal to 6, well, I just divide these two values together, and 1 over 36 divided by 1 over 6 gives us this probability of 1 over 6. Given that I know that the red die rolled a value of 6, the probability that the sum of the two dice is 12 is also 1 over 6. And that probably makes intuitive sense to you, too, because if the red die is a 6, the only way for me to get to a 12 is if the blue die also rolls a 6, and we know that the probability of the blue die rolling a 6 is 1 over 6. So in this case, the conditional probability seems fairly straightforward. But this idea of calculating a conditional probability by looking at the probability that both of these events take place is an idea that’s going to come up again and again. This is the definition now of conditional probability. And we’re going to use that definition as we think about probability more generally to be able to draw conclusions about the world. This, again, is that formula. The probability of A given B is equal to the probability that A and B take place divided by the probability of B. And you’ll see this formula sometimes written in a couple of different ways. You could imagine algebraically multiplying both sides of this equation by probability of B to get rid of the fraction, and you’ll get an expression like this. The probability of A and B, which is this expression over here, is just the probability of B times the probability of A given B. Or you could represent this equivalently since A and B in this expression are interchangeable. A and B is the same thing as B and A. You could imagine also representing the probability of A and B as the probability of A times the probability of B given A, just switching all of the A’s and B’s. These three are all equivalent ways of trying to represent what joint probability means. And so you’ll sometimes see all of these equations, and they might be useful to you as you begin to reason about probability and to think about what values might be taking place in the real world. Now, sometimes when we deal with probability, we don’t just care about a Boolean event like did this happen or did this not happen. Sometimes we might want the ability to represent variable values in a probability space where some variable might take on multiple different possible values. And in probability, we call a variable in probability theory a random variable. A random variable in probability is just some variable in probability theory that has some domain of values that it can take on. So what do I mean by this? Well, what I mean is I might have a random variable that is just called roll, for example, that has six possible values. Roll is my variable, and the possible values, the domain of values that it can take on are 1, 2, 3, 4, 5, and 6. And I might like to know the probability of each. In this case, they happen to all be the same. But in other random variables, that might not be the case. For example, I might have a random variable to represent the weather, for example, where the domain of values it could take on are things like sun or cloudy or rainy or windy or snowy. And each of those might have a different probability. And I care about knowing what is the probability that the weather equals sun or that the weather equals clouds, for instance. And I might like to do some mathematical calculations based on that information. Other random variables might be something like traffic. What are the odds that there is no traffic or light traffic or heavy traffic? Traffic, in this case, is my random variable. And the values that that random variable can take on are here. It’s either none or light or heavy. And I, the person doing these calculations, I, the person encoding these random variables into my computer, need to make the decision as to what these possible values actually are. You might imagine, for example, for a flight. If I care about whether or not I make it or do a flight on time, my flight has a couple of possible values that it could take on. My flight could be on time. My flight could be delayed. My flight could be canceled. So flight, in this case, is my random variable. And these are the values that it can take on. And often, I want to know something about the probability that my random variable takes on each of those possible values. And this is what we then call a probability distribution. A probability distribution takes a random variable and gives me the probability for each of the possible values in its domain. So in the case of this flight, for example, my probability distribution might look something like this. My probability distribution says the probability that the random variable flight is equal to the value on time is 0.6. Or otherwise, put into more English human-friendly terms, the likelihood that my flight is on time is 60%, for example. And in this case, the probability that my flight is delayed is 30%. The probability that my flight is canceled is 10% or 0.1. And if you sum up all of these possible values, the sum is going to be 1, right? If you take all of the possible worlds, here are my three possible worlds for the value of the random variable flight, add them all up together, the result needs to be the number 1 per that axiom of probability theory that we’ve discussed before. So this now is one way of representing this probability distribution for the random variable flight. Sometimes you’ll see it represented a little bit more concisely that this is pretty verbose for really just trying to express three possible values. And so often, you’ll instead see the same notation representing using a vector. And all a vector is is a sequence of values. As opposed to just a single value, I might have multiple values. And so I could extend instead, represent this idea this way. Bold p, so a larger p, generally meaning the probability distribution of this variable flight is equal to this vector represented in angle brackets. The probability distribution is 0.6, 0.3, and 0.1. And I would just have to know that this probability distribution is in order of on time or delayed and canceled to know how to interpret this vector. To mean the first value in the vector is the probability that my flight is on time. The second value in the vector is the probability that my flight is delayed. And the third value in the vector is the probability that my flight is canceled. And so this is just an alternate way of representing this idea, a little more verbosely. But oftentimes, you’ll see us just talk about a probability distribution over a random variable. And whenever we talk about that, what we’re really doing is trying to figure out the probabilities of each of the possible values that that random variable can take on. But this notation is just a little bit more succinct, even though it can sometimes be a little confusing, depending on the context in which you see it. So we’ll start to look at examples where we use this sort of notation to describe probability and to describe events that might take place. A couple of other important ideas to know with regards to probability theory. One is this idea of independence. And independence refers to the idea that the knowledge of one event doesn’t influence the probability of another event. So for example, in the context of my two dice rolls, where I had the red die and the blue die, the probability that I roll the red die and the blue die, those two events, red die and blue die, are independent. Knowing the result of the red die doesn’t change the probabilities for the blue die. It doesn’t give me any additional information about what the value of the blue die is ultimately going to be. But that’s not always going to be the case. You might imagine that in the case of weather, something like clouds and rain, those are probably not independent. But if it is cloudy, that might increase the probability that later in the day it’s going to rain. So some information informs some other event or some other random variable. So independence refers to the idea that one event doesn’t influence the other. And if they’re not independent, then there might be some relationship. So mathematically, formally, what does independence actually mean? Well, recall this formula from before, that the probability of A and B is the probability of A times the probability of B given A. And the more intuitive way to think about this is that to know how likely it is that A and B happen, well, let’s first figure out the likelihood that A happens. And then given that we know that A happens, let’s figure out the likelihood that B happens and multiply those two things together. But if A and B were independent, meaning knowing A doesn’t change anything about the likelihood that B is true, well, then the probability of B given A, meaning the probability that B is true, given that I know A is true, well, that I know A is true shouldn’t really make a difference if these two things are independent, that A shouldn’t influence B at all. So the probability of B given A is really just the probability of B. If it is true that A and B are independent. And so this right here is one example of a definition for what it means for A and B to be independent. The probability of A and B is just the probability of A times the probability of B. Anytime you find two events A and B where this relationship holds, then you can say that A and B are independent. So an example of that might be the dice that we were taking a look at before. Here, if I wanted the probability of red being a 6 and blue being a 6, well, that’s just the probability that red is a 6 multiplied by the probability that blue is a 6. It’s both equal to 1 over 36. So I can say that these two events are independent. What wouldn’t be independent, for example, would be an example. So this, for example, has a probability of 1 over 36, as we talked about before. But what wouldn’t be independent would be a case like this, the probability that the red die rolls a 6 and the red die rolls a 4. If you just naively took, OK, red die 6, red die 4, well, if I’m only rolling the die once, you might imagine the naive approach is to say, well, each of these has a probability of 1 over 6. So multiply them together, and the probability is 1 over 36. But of course, if you’re only rolling the red die once, there’s no way you could get two different values for the red die. It couldn’t both be a 6 and a 4. So the probability should be 0. But if you were to multiply probability of red 6 times probability of red 4, well, that would equal 1 over 36. But of course, that’s not true. Because we know that there is no way, probability 0, that when we roll the red die once, we get both a 6 and a 4, because only one of those possibilities can actually be the result. And so we can say that the event that red roll is 6 and the event that red roll is 4, those two events are not independent. If I know that the red roll is a 6, I know that the red roll cannot possibly be a 4, so these things are not independent. And instead, if I wanted to calculate the probability, I would need to use this conditional probability as the regular definition of the probability of two events taking place. And the probability of this now, well, the probability of the red roll being a 6, that’s 1 over 6. But what’s the probability that the roll is a 4 given that the roll is a 6? Well, this is just 0, because there’s no way for the red roll to be a 4, given that we already know the red roll is a 6. And so the value, if we do add all that multiplication, is we get the number 0. So this idea of conditional probability is going to come up again and again, especially as we begin to reason about multiple different random variables that might be interacting with each other in some way. And this gets us to one of the most important rules in probability theory, which is known as Bayes rule. And it turns out that just using the information we’ve already learned about probability and just applying a little bit of algebra, we can actually derive Bayes rule for ourselves. But it’s a very important rule when it comes to inference and thinking about probability in the context of what it is that a computer can do or what a mathematician could do by having access to information about probability. So let’s go back to these equations to be able to derive Bayes rule ourselves. We know the probability of A and B, the likelihood that A and B take place, is the likelihood of B, and then the likelihood of A, given that we know that B is already true. And likewise, the probability of A given A and B is the probability of A times the probability of B, given that we know that A is already true. This is sort of a symmetric relationship where it doesn’t matter the order of A and B and B and A mean the same thing. And so in these equations, we can just swap out A and B to be able to represent the exact same idea. So we know that these two equations are already true. We’ve seen that already. And now let’s just do a little bit of algebraic manipulation of this stuff. Both of these expressions on the right-hand side are equal to the probability of A and B. So what I can do is take these two expressions on the right-hand side and just set them equal to each other. If they’re both equal to the probability of A and B, then they both must be equal to each other. So probability of A times probability of B given A is equal to the probability of B times the probability of A given B. And now all we’re going to do is do a little bit of division. I’m going to divide both sides by P of A. And now I get what is Bayes’ rule. The probability of B given A is equal to the probability of B times the probability of A given B divided by the probability of A. And sometimes in Bayes’ rule, you’ll see the order of these two arguments switched. So instead of B times A given B, it’ll be A given B times B. That ultimately doesn’t matter because in multiplication, you can switch the order of the two things you’re multiplying, and it doesn’t change the result. But this here right now is the most common formulation of Bayes’ rule. The probability of B given A is equal to the probability of A given B times the probability of B divided by the probability of A. And this rule, it turns out, is really important when it comes to trying to infer things about the world, because it means you can express one conditional probability, the conditional probability of B given A, using knowledge about the probability of A given B, using the reverse of that conditional probability. So let’s first do a little bit of an example with this, just to see how we might use it, and then explore what this means a little bit more generally. So we’re going to construct a situation where I have some information. There are two events that I care about, the idea that it’s cloudy in the morning and the idea that it is rainy in the afternoon. Those are two different possible events that could take place, cloudy in the morning, or the AM, rainy in the PM. And what I care about is, given clouds in the morning, what is the probability of rain in the afternoon? A reasonable question I might ask, in the morning, I look outside, or an AI’s camera looks outside and sees that there are clouds in the morning. And we want to conclude, we want to figure out what is the probability that in the afternoon, there is going to be rain. Of course, in the abstract, we don’t have access to this kind of information, but we can use data to begin to try and figure this out. So let’s imagine now that I have access to some pieces of information. I have access to the idea that 80% of rainy afternoons start out with a cloudy morning. And you might imagine that I could have gathered this data just by looking at data over a sequence of time, that I know that 80% of the time when it’s raining in the afternoon, it was cloudy that morning. I also know that 40% of days have cloudy mornings. And I also know that 10% of days have rainy afternoons. And now using this information, I would like to figure out, given clouds in the morning, what is the probability that it rains in the afternoon? I want to know the probability of afternoon rain given morning clouds. And I can do that, in particular, using this fact, the probability of, so if I know that 80% of rainy afternoons start with cloudy mornings, then I know the probability of cloudy mornings given rainy afternoons. So using sort of the reverse conditional probability, I can figure that out. Expressed in terms of Bayes rule, this is what that would look like. Probability of rain given clouds is the probability of clouds given rain times the probability of rain divided by the probability of clouds. Here I’m just substituting in for the values of a and b from that equation of Bayes rule from before. And then I can just do the math. I have this information. I know that 80% of the time, if it was raining, then there were clouds in the morning. So 0.8 here. Probability of rain is 0.1, because 10% of days were rainy, and 40% of days were cloudy. I do the math, and I can figure out the answer is 0.2. So the probability that it rains in the afternoon, given that it was cloudy in the morning, is 0.2 in this case. And this now is an application of Bayes rule, the idea that using one conditional probability, we can get the reverse conditional probability. And this is often useful when one of the conditional probabilities might be easier for us to know about or easier for us to have data about. And using that information, we can calculate the other conditional probability. So what does this look like? Well, it means that knowing the probability of cloudy mornings given rainy afternoons, we can calculate the probability of rainy afternoons given cloudy mornings. Or, for example, more generally, if we know the probability of some visible effect, some effect that we can see and observe, given some unknown cause that we’re not sure about, well, then we can calculate the probability of that unknown cause given the visible effect. So what might that look like? Well, in the context of medicine, for example, I might know the probability of some medical test result given a disease. Like, I know that if someone has a disease, then x% of the time the medical test result will show up as this, for instance. And using that information, then I can calculate, all right, what is the probability that given I know the medical test result, what is the likelihood that someone has the disease? This is the piece of information that is usually easier to know, easier to immediately have access to data for. And this is the information that I actually want to calculate. Or I might want to know, for example, if I know that some probability of counterfeit bills have blurry text around the edges, because counterfeit printers aren’t nearly as good at printing text precisely. So I have some information about, given that something is a counterfeit bill, like x% of counterfeit bills have blurry text, for example. And using that information, then I can calculate some piece of information that I might want to know, like, given that I know there’s blurry text on a bill, what is the probability that that bill is counterfeit? So given one conditional probability, I can calculate the other conditional probability as well. And so now we’ve taken a look at a couple of different types of probability. And we’ve looked at unconditional probability, where I just look at what is the probability of this event occurring, given no additional evidence that I might have access to. And we’ve also looked at conditional probability, where I have some sort of evidence, and I would like to, using that evidence, be able to calculate some other probability as well. And the other kind of probability that will be important for us to think about is joint probability. And this is when we’re considering the likelihood of multiple different events simultaneously. And so what do we mean by this? For example, I might have probability distributions that look a little something like this. Like, oh, I want to know the probability distribution of clouds in the morning. And that distribution looks like this. 40% of the time, C, which is my random variable here, is equal to it’s cloudy. And 60% of the time, it’s not cloudy. So here is just a simple probability distribution that is effectively telling me that 40% of the time, it’s cloudy. I might also have a probability distribution for rain in the afternoon, where 10% of the time, or with probability 0.1, it is raining in the afternoon. And with probability 0.9, it is not raining in the afternoon. And using just these two pieces of information, I don’t actually have a whole lot of information about how these two variables relate to each other. But I could if I had access to their joint probability, meaning for every combination of these two things, meaning morning cloudy and afternoon rain, morning cloudy and afternoon not rain, morning not cloudy and afternoon rain, and morning not cloudy and afternoon not raining, if I had access to values for each of those four, I’d have more information. So information that’d be organized in a table like this, and this, rather than just a probability distribution, is a joint probability distribution. It tells me the probability distribution of each of the possible combinations of values that these random variables can take on. So if I want to know what is the probability that on any given day it is both cloudy and rainy, well, I would say, all right, we’re looking at cases where it is cloudy and cases where it is raining. And the intersection of those two, that row in that column, is 0.08. So that is the probability that it is both cloudy and rainy using that information. And using this conditional probability table, using this joint probability table, I can begin to draw other pieces of information about things like conditional probability. So I might ask a question like, what is the probability distribution of clouds given that I know that it is raining? Meaning I know for sure that it’s raining. Tell me the probability distribution over whether it’s cloudy or not, given that I know already that it is, in fact, raining. And here I’m using C to stand for that random variable. I’m looking for a distribution, meaning the answer to this is not going to be a single value. It’s going to be two values, a vector of two values, where the first value is probability of clouds, the second value is probability that it is not cloudy, but the sum of those two values is going to be 1. Because when you add up the probabilities of all of the possible worlds, the result that you get must be the number 1. And well, what do we know about how to calculate a conditional probability? Well, we know that the probability of A given B is the probability of A and B divided by the probability of B. So what does this mean? Well, it means that I can calculate the probability of clouds given that it’s raining as the probability of clouds and raining divided by the probability of rain. And this comma here for the probability distribution of clouds and rain, this comma sort of stands in for the word and. You’ll sort of see in the logical operator and and the comma used interchangeably. This means the probability distribution over the clouds and knowing the fact that it is raining divided by the probability of rain. And the interesting thing to note here and what we’ll often do in order to simplify our mathematics is that dividing by the probability of rain, the probability of rain here is just some numerical constant. It is some number. Dividing by probability of rain is just dividing by some constant, or in other words, multiplying by the inverse of that constant. And it turns out that oftentimes we can just not worry about what the exact value of this is and just know that it is, in fact, a constant value. And we’ll see why in a moment. So instead of expressing this as this joint probability divided by the probability of rain, sometimes we’ll just represent it as alpha times the numerator here, the probability distribution of C, this variable, and that we know that it is raining, for instance. So all we’ve done here is said this value of 1 over the probability of rain, that’s really just a constant we’re going to divide by or equivalently multiply by the inverse of at the end. We’ll just call it alpha for now and deal with it a little bit later. But the key idea here now, and this is an idea that’s going to come up again, is that the conditional distribution of C given rain is proportional to, meaning just some factor multiplied by the joint probability of C and rain being true. And so how do we figure this out? Well, this is going to be the probability that it is cloudy given that it’s raining, which is 0.08, and the probability that it’s not cloudy given that it’s raining, which is 0.02. And so we get alpha times here now is that probability distribution. 0.08 is clouds and rain. 0.02 is not cloudy and rain. But of course, 0.08 and 0.02 don’t sum up to the number 1. And we know that in a probability distribution, if you consider all of the possible values, they must sum up to a probability of 1. And so we know that we just need to figure out some constant to normalize, so to speak, these values, something we can multiply or divide by to get it so that all these probabilities sum up to 1, and it turns out that if we multiply both numbers by 10, then we can get that result of 0.8 and 0.2. The proportions are still equivalent, but now 0.8 plus 0.2, those sum up to the number 1. So take a look at this and see if you can understand step by step how it is we’re getting from one point to another. The key idea here is that by using the joint probabilities, these probabilities that it is both cloudy and rainy and that it is not cloudy and rainy, I can take that information and figure out the conditional probability given that it’s raining. What is the chance that it’s cloudy versus not cloudy? Just by multiplying by some normalization constant, so to speak. And this is what a computer can begin to use to be able to interact with these various different types of probabilities. And it turns out there are a number of other probability rules that are going to be useful to us as we begin to explore how we can actually use this information to encode into our computers some more complex analysis that we might want to do about probability and distributions and random variables that we might be interacting with. So here are a couple of those important probability rules. One of the simplest rules is just this negation rule. What is the probability of not event A? So A is an event that has some probability, and I would like to know what is the probability that A does not occur. And it turns out it’s just 1 minus P of A, which makes sense. Because if those are the two possible cases, either A happens or A doesn’t happen, then when you add up those two cases, you must get 1, which means that P of not A must just be 1 minus P of A. Because P of A and P of not A must sum up to the number 1. They must include all of the possible cases. We’ve seen an expression for calculating the probability of A and B. We might also reasonably want to calculate the probability of A or B. What is the probability that one thing happens or another thing happens? So for example, I might want to calculate what is the probability that if I roll two dice, a red die and a blue die, what is the likelihood that A is a 6 or B is a 6, like one or the other? And what you might imagine you could do, and the wrong way to approach it, would be just to say, all right, well, A comes up as a 6 with the red die comes up as a 6 with probability 1 over 6. The same for the blue die, it’s also 1 over 6. Add them together, and you get 2 over 6, otherwise known as 1 third. But this suffers from a problem of over counting, that we’ve double counted the case, where both A and B, both the red die and the blue die, both come up as a 6-roll. And I’ve counted that instance twice. So to resolve this, the actual expression for calculating the probability of A or B uses what we call the inclusion-exclusion formula. So I take the probability of A, add it to the probability of B. That’s all same as before. But then I need to exclude the cases that I’ve double counted. So I subtract from that the probability of A and B. And that gets me the result for A or B. I consider all the cases where A is true and all the cases where B is true. And if you imagine this is like a Venn diagram of cases where A is true, cases where B is true, I just need to subtract out the middle to get rid of the cases that I have overcounted by double counting them inside of both of these individual expressions. One other rule that’s going to be quite helpful is a rule called marginalization. So marginalization is answering the question of how do I figure out the probability of A using some other variable that I might have access to, like B? Even if I don’t know additional information about it, I know that B, some event, can have two possible states, either B happens or B doesn’t happen, assuming it’s a Boolean, true or false. And well, what that means is that for me to be able to calculate the probability of A, there are only two cases. Either A happens and B happens, or A happens and B doesn’t happen. And those are two disjoint, meaning they can’t both happen together. Either B happens or B doesn’t happen. They’re disjoint or separate cases. And so I can figure out the probability of A just by adding up those two cases. The probability that A is true is the probability that A and B is true, plus the probability that A is true and B isn’t true. So by marginalizing, I’ve looked at the two possible cases that might take place, either B happens or B doesn’t happen. And in either of those cases, I look at what’s the probability that A happens. And if I add those together, well, then I get the probability that A happens as a whole. So take a look at that rule. It doesn’t matter what B is or how it’s related to A. So long as I know these joint distributions, I can figure out the overall probability of A. And this can be a useful way if I have a joint distribution, like the joint distribution of A and B, to just figure out some unconditional probability, like the probability of A. And we’ll see examples of this soon as well. Now, sometimes these might not just be random, might not just be variables that are events that are like they happened or they didn’t happen, like B is here. They might be some broader probability distribution where there are multiple possible values. And so here, in order to use this marginalization rule, I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on. And so here, we’ll see a version of this rule for random variables. And it’s going to include that summation notation to indicate that I’m summing up, adding up a whole bunch of individual values. So here’s the rule. Looks a lot more complicated, but it’s actually the equivalent exactly the same rule. What I’m saying here is that if I have two random variables, one called x and one called y, well, the probability that x is equal to some value x sub i, this is just some value that this variable takes on. How do I figure it out? Well, I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj. So the exact same rule, the only difference here is now I’m summing up over all of the possible values that y can take on, saying let’s add up all of those possible cases and look at this joint distribution, this joint probability, that x takes on the value I care about, given all of the possible values for y. And if I add all those up, then I can get this unconditional probability of what x is equal to, whether or not x is equal to some value x sub i. So let’s take a look at this rule, because it does look a little bit complicated. Let’s try and put a concrete example to it. Here again is that same joint distribution from before. I have cloud, not cloudy, rainy, not rainy. And maybe I want to access some variable. I want to know what is the probability that it is cloudy. Well, marginalization says that if I have this joint distribution and I want to know what is the probability that it is cloudy, well, I need to consider the other variable, the variable that’s not here, the idea that it’s rainy. And I consider the two cases, either it’s raining or it’s not raining. And I just sum up the values for each of those possibilities. In other words, the probability that it is cloudy is equal to the sum of the probability that it’s cloudy and it’s rainy and the probability that it’s cloudy and it is not raining. And so these now are values that I have access to. These are values that are just inside of this joint probability table. What is the probability that it is both cloudy and rainy? Well, it’s just the intersection of these two here, which is 0.08. And the probability that it’s cloudy and not raining is, all right, here’s cloudy, here’s not raining. It’s 0.32. So it’s 0.08 plus 0.32, which just gives us equal to 0.4. That is the unconditional probability that it is, in fact, cloudy. And so marginalization gives us a way to go from these joint distributions to just some individual probability that I might care about. And you’ll see a little bit later why it is that we care about that and why that’s actually useful to us as we begin doing some of these calculations. Last rule we’ll take a look at before transitioning to something a little bit different is this rule of conditioning, very similar to the marginalization rule. But it says that, again, if I have two events, a and b, but instead of having access to their joint probabilities, I have access to their conditional probabilities, how they relate to each other. Well, again, if I want to know the probability that a happens, and I know that there’s some other variable b, either b happens or b doesn’t happen, and so I can say that the probability of a is the probability of a given b times the probability of b, meaning b happened. And given that I know b happened, what’s the likelihood that a happened? And then I consider the other case, that b didn’t happen. So here’s the probability that b didn’t happen. And here’s the probability that a happens, given that I know that b didn’t happen. And this is really the equivalent rule just using conditional probability instead of joint probability, where I’m saying let’s look at both of these two cases and condition on b. Look at the case where b happens, and look at the case where b doesn’t happen, and look at what probabilities I get as a result. And just as in the case of marginalization, where there was an equivalent rule for random variables that could take on multiple possible values in a domain of possible values, here, too, conditioning has the same equivalent rule. Again, there’s a summation to mean I’m summing over all of the possible values that some random variable y could take on. But if I want to know what is the probability that x takes on this value, then I’m going to sum up over all the values j that y could take on, and say, all right, what’s the chance that y takes on that value yj? And multiply it by the conditional probability that x takes on this value, given that y took on that value yj. So equivalent rule just using conditional probabilities instead of joint probabilities. And using the equation we know about joint probabilities, we can translate between these two. So all right, we’ve seen a whole lot of mathematics, and we’ve just laid the foundation for mathematics. And no need to worry if you haven’t seen probability in too much detail up until this point. These are the foundations of the ideas that are going to come up as we begin to explore how we can now take these ideas from probability and begin to apply them to represent something inside of our computer, something inside of the AI agent we’re trying to design that is able to represent information and probabilities and the likelihoods between various different events. So there are a number of different probabilistic models that we can generate, but the first of the models we’re going to talk about are what are known as Bayesian networks. And a Bayesian network is just going to be some network of random variables, connected random variables that are going to represent the dependence between these random variables. The odds are most random variables in this world are not independent from each other, but there’s some relationship between things that are happening that we care about. If it is rainy today, that might increase the likelihood that my flight or my train gets delayed, for example. There are some dependence between these random variables, and a Bayesian network is going to be able to capture those dependencies. So what is a Bayesian network? What is its actual structure, and how does it work? Well, a Bayesian network is going to be a directed graph. And again, we’ve seen directed graphs before. They are individual nodes with arrows or edges that connect one node to another node pointing in a particular direction. And so this directed graph is going to have nodes as well, where each node in this directed graph is going to represent a random variable, something like the weather, or something like whether my train was on time or delayed. And we’re going to have an arrow from a node x to a node y to mean that x is a parent of y. So that’ll be our notation. If there’s an arrow from x to y, x is going to be considered a parent of y. And the reason that’s important is because each of these nodes is going to have a probability distribution that we’re going to store along with it, which is the distribution of x given some evidence, given the parents of x. So the way to more intuitively think about this is the parents seem to be thought of as sort of causes for some effect that we’re going to observe. And so let’s take a look at an actual example of a Bayesian network and think about the types of logic that might be involved in reasoning about that network. Let’s imagine for a moment that I have an appointment out of town, and I need to take a train in order to get to that appointment. So what are the things I might care about? Well, I care about getting to my appointment on time. Whether I make it to my appointment and I’m able to attend it or I miss the appointment. And you might imagine that that’s influenced by the train, that the train is either on time or it’s delayed, for example. But that train itself is also influenced. Whether the train is on time or not depends maybe on the rain. Is there no rain? Is it light rain? Is there heavy rain? And it might also be influenced by other variables too. It might be influenced as well by whether or not there’s maintenance on the train track, for example. If there is maintenance on the train track, that probably increases the likelihood that my train is delayed. And so we can represent all of these ideas using a Bayesian network that looks a little something like this. Here I have four nodes representing four random variables that I would like to keep track of. I have one random variable called rain that can take on three possible values in its domain, either none or light or heavy, for no rain, light rain, or heavy rain. I have a variable called maintenance for whether or not there is maintenance on the train track, which it has two possible values, just either yes or no. Either there is maintenance or there’s no maintenance happening on the track. Then I have a random variable for the train indicating whether or not the train was on time or not. That random variable has two possible values in its domain. The train is either on time or the train is delayed. And then finally, I have a random variable for whether I make it to my appointment. For my appointment down here, I have a random variable called appointment that itself has two possible values, attend and miss. And so here are the possible values. Here are my four nodes, each of which represents a random variable, each of which has a domain of possible values that it can take on. And the arrows, the edges pointing from one node to another, encode some notion of dependence inside of this graph, that whether I make it to my appointment or not is dependent upon whether the train is on time or delayed. And whether the train is on time or delayed is dependent on two things given by the two arrows pointing at this node. It is dependent on whether or not there was maintenance on the train track. And it is also dependent upon whether or not it was raining or whether it is raining. And just to make things a little complicated, let’s say as well that whether or not there is maintenance on the track, this too might be influenced by the rain. That if there’s heavier rain, well, maybe it’s less likely that it’s going to be maintenance on the train track that day because they’re more likely to want to do maintenance on the track on days when it’s not raining, for example. And so these nodes might have different relationships between them. But the idea is that we can come up with a probability distribution for any of these nodes based only upon its parents. And so let’s look node by node at what this probability distribution might actually look like. And we’ll go ahead and begin with this root node, this rain node here, which is at the top, and has no arrows pointing into it, which means its probability distribution is not going to be a conditional distribution. It’s not based on anything. I just have some probability distribution over the possible values for the rain random variable. And that distribution might look a little something like this. None, light and heavy, each have a possible value. Here I’m saying the likelihood of no rain is 0.7, of light rain is 0.2, of heavy rain is 0.1, for example. So here is a probability distribution for this root node in this Bayesian network. And let’s now consider the next node in the network, maintenance. Track maintenance is yes or no. And the general idea of what this distribution is going to encode, at least in this story, is the idea that the heavier the rain is, the less likely it is that there’s going to be maintenance on the track. Because the people that are doing maintenance on the track probably want to wait until a day when it’s not as rainy in order to do the track maintenance, for example. And so what might that probability distribution look like? Well, this now is going to be a conditional probability distribution, that here are the three possible values for the rain random variable, which I’m here just going to abbreviate to R, either no rain, light rain, or heavy rain. And for each of those possible values, either there is yes track maintenance or no track maintenance. And those have probabilities associated with them. That I see here that if it is not raining, then there is a probability of 0.4 that there’s track maintenance and a probability of 0.6 that there isn’t. But if there’s heavy rain, then here the chance that there is track maintenance is 0.1 and the chance that there is not track maintenance is 0.9. Each of these rows is going to sum up to 1. Because each of these represent different values of whether or not it’s raining, the three possible values that that random variable can take on. And each is associated with its own probability distribution that is ultimately all going to add up to the number 1. So that there is our distribution for this random variable called maintenance, about whether or not there is maintenance on the train track. And now let’s consider the next variable. Here we have a node inside of our Bayesian network called train that has two possible values, on time and delayed. And this node is going to be dependent upon the two nodes that are pointing towards it, that whether or not the train is on time or delayed depends on whether or not there is track maintenance. And it depends on whether or not there is rain, that heavier rain probably means more likely that my train is delayed. And if there is track maintenance, that also probably means it’s more likely that my train is delayed as well. And so you could construct a larger probability distribution, a conditional probability distribution, that instead of conditioning on just one variable, as was the case here, is now conditioning on two variables, conditioning both on rain represented by r and on maintenance represented by yes. Again, each of these rows has two values that sum up to the number 1, one for whether the train is on time, one for whether the train is delayed. And here I can say something like, all right, if I know there was light rain and track maintenance, well, OK, that would be r is light and m is yes. Well, then there is a probability of 0.6 that my train is on time, and a probability of 0.4 the train is delayed. And you can imagine gathering this data just by looking at real world data, looking at data about, all right, if I knew that it was light rain and there was track maintenance, how often was a train delayed or not delayed? And you could begin to construct this thing. The interesting thing is intelligently, being able to try to figure out how might you go about ordering these things, what things might influence other nodes inside of this Bayesian network. And the last thing I care about is whether or not I make it to my appointment. So did I attend or miss the appointment? And ultimately, whether I attend or miss the appointment, it is influenced by track maintenance, because it’s indirectly this idea that, all right, if there is track maintenance, well, then my train might more likely be delayed. And if my train is more likely to be delayed, then I’m more likely to miss my appointment. But what we encode in this Bayesian network are just what we might consider to be more direct relationships. So the train has a direct influence on the appointment. And given that I know whether the train is on time or delayed, knowing whether there’s track maintenance isn’t going to give me any additional information that I didn’t already have. That if I know train, these other nodes that are up above isn’t really going to influence the result. And so here we might represent it using another conditional probability distribution that looks a little something like this. The train can take on two possible values. Either my train is on time or my train is delayed. And for each of those two possible values, I have a distribution for what are the odds that I’m able to attend the meeting and what are the odds that I missed the meeting. And obviously, if my train is on time, I’m much more likely to be able to attend the meeting than if my train is delayed, in which case I’m more likely to miss that meeting. So all of these nodes put all together here represent this Bayesian network, this network of random variables whose values I ultimately care about, and that have some sort of relationship between them, some sort of dependence where these arrows from one node to another indicate some dependence, that I can calculate the probability of some node given the parents that happen to exist there. So now that we’ve been able to describe the structure of this Bayesian network and the relationships between each of these nodes by associating each of the nodes in the network with a probability distribution, whether that’s an unconditional probability distribution in the case of this root node here, like rain, and a conditional probability distribution in the case of all of the other nodes whose probabilities are dependent upon the values of their parents, we can begin to do some computation and calculation using the information inside of that table. So let’s imagine, for example, that I just wanted to compute something simple like the probability of light rain. How would I get the probability of light rain? Well, light rain, rain here is a root node. And so if I wanted to calculate that probability, I could just look at the probability distribution for rain and extract from it the probability of light rains, just a single value that I already have access to. But we could also imagine wanting to compute more complex joint probabilities, like the probability that there is light rain and also no track maintenance. This is a joint probability of two values, light rain and no track maintenance. And the way I might do that is first by starting by saying, all right, well, let me get the probability of light rain. But now I also want the probability of no track maintenance. But of course, this node is dependent upon the value of rain. So what I really want is the probability of no track maintenance, given that I know that there was light rain. And so the expression for calculating this idea that the probability of light rain and no track maintenance is really just the probability of light rain and the probability that there is no track maintenance, given that I know that there already is light rain. So I take the unconditional probability of light rain, multiply it by the conditional probability of no track maintenance, given that I know there is light rain. And you can continue to do this again and again for every variable that you want to add into this joint probability that I might want to calculate. If I wanted to know the probability of light rain and no track maintenance and a delayed train, well, that’s going to be the probability of light rain, multiplied by the probability of no track maintenance, given light rain, multiplied by the probability of a delayed train, given light rain and no track maintenance. Because whether the train is on time or delayed is dependent upon both of these other two variables. And so I have two pieces of evidence that go into the calculation of that conditional probability. And each of these three values is just a value that I can look up by looking at one of these individual probability distributions that is encoded into my Bayesian network. And if I wanted a joint probability over all four of the variables, something like the probability of light rain and no track maintenance and a delayed train and I miss my appointment, well, that’s going to be multiplying four different values, one from each of these individual nodes. It’s going to be the probability of light rain, then of no track maintenance given light rain, then of a delayed train, given light rain and no track maintenance. And then finally, for this node here, for whether I make it to my appointment or not, it’s not dependent upon these two variables, given that I know whether or not the train is on time. I only need to care about the conditional probability that I miss my train, or that I miss my appointment, given that the train happens to be delayed. And so that’s represented here by four probabilities, each of which is located inside of one of these probability distributions for each of the nodes, all multiplied together. And so I can take a variable like that and figure out what the joint probability is by multiplying a whole bunch of these individual probabilities from the Bayesian network. But of course, just as with last time, where what I really wanted to do was to be able to get new pieces of information, here, too, this is what we’re going to want to do with our Bayesian network. In the context of knowledge, we talked about the problem of inference. Given things that I know to be true, can I draw conclusions, make deductions about other facts about the world that I also know to be true? And what we’re going to do now is apply the same sort of idea to probability. Using information about which I have some knowledge, whether some evidence or some probabilities, can I figure out not other variables for certain, but can I figure out the probabilities of other variables taking on particular values? And so here, we introduce the problem of inference in a probabilistic setting, in a case where variables might not necessarily be true for sure, but they might be random variables that take on different values with some probability. So how do we formally define what exactly this inference problem actually is? Well, the inference problem has a couple of parts to it. We have some query, some variable x that we want to compute the distribution for. Maybe I want the probability that I miss my train, or I want the probability that there is track maintenance, something that I want information about. And then I have some evidence variables. Maybe it’s just one piece of evidence. Maybe it’s multiple pieces of evidence. But I’ve observed certain variables for some sort of event. So for example, I might have observed that it is raining. This is evidence that I have. I know that there is light rain, or I know that there is heavy rain. And that is evidence I have. And using that evidence, I want to know what is the probability that my train is delayed, for example. And that is a query that I might want to ask based on this evidence. So I have a query, some variable. Evidence, which are some other variables that I have observed inside of my Bayesian network. And of course, that does leave some hidden variables. Why? These are variables that are not evidence variables and not query variables. So you might imagine in the case where I know whether or not it’s raining, and I want to know whether my train is going to be delayed or not, the hidden variable, the thing I don’t have access to, is something like, is there maintenance on the track? Or am I going to make or not make my appointment, for example? These are variables that I don’t have access to. They’re hidden because they’re not things I observed, and they’re also not the query, the thing that I’m asking. And so ultimately, what we want to calculate is I want to know the probability distribution of x given e, the event that I observed. So given that I observed some event, I observed that it is raining, I would like to know what is the distribution over the possible values of the train random variable. Is it on time? Is it delayed? What’s the likelihood it’s going to be there? And it turns out we can do this calculation just using a lot of the probability rules that we’ve already seen in action. And ultimately, we’re going to take a look at the math at a little bit of a high level, at an abstract level. But ultimately, we can allow computers and programming libraries that already exist to begin to do some of this math for us. But it’s good to get a general sense for what’s actually happening when this inference process takes place. Let’s imagine, for example, that I want to compute the probability distribution of the appointment random variable given some evidence, given that I know that there was light rain and no track maintenance. So there’s my evidence, these two variables that I observe the values of. I observe the value of rain. I know there’s light rain. And I know that there is no track maintenance going on today. And what I care about knowing, my query, is this random variable appointment. I want to know the distribution of this random variable appointment, like what is the chance that I’m able to attend my appointment? What is the chance that I miss my appointment given this evidence? And the hidden variable, the information that I don’t have access to, is this variable train. This is information that is not part of the evidence that I see, not something that I observe. But it is also not the query that I’m asking for. And so what might this inference procedure look like? Well, if you recall back from when we were defining conditional probability and doing math with conditional probabilities, we know that a conditional probability is proportional to the joint probability. And we remembered this by recalling that the probability of A given B is just some constant factor alpha multiplied by the probability of A and B. That constant factor alpha turns out to be like dividing over the probability of B. But the important thing is that it’s just some constant multiplied by the joint distribution, the probability that all of these individual things happen. So in this case, I can take the probability of the appointment random variable given light rain and no track maintenance and say that is just going to be proportional, some constant alpha, multiplied by the joint probability, the probability of a particular value for the appointment random variable and light rain and no track maintenance. Well, all right, how do I calculate this, probability of appointment and light rain and no track maintenance, when what I really care about is knowing I need all four of these values to be able to calculate a joint distribution across everything because in a particular appointment depends upon the value of train? Well, in order to do that, here I can begin to use that marginalization trick, that there are only two ways I can get any configuration of an appointment, light rain, and no track maintenance. Either this particular setting of variables happens and the train is on time, or this particular setting of variables happens and the train is delayed. Those are two possible cases that I would want to consider. And if I add those two cases up, well, then I get the result just by adding up all of the possibilities for the hidden variable or variables that there are multiple. But since there’s only one hidden variable here, train, all I need to do is iterate over all the possible values for that hidden variable train and add up their probabilities. So this probability expression here becomes probability distribution over appointment, light, no rain, and train is on time, and the probability distribution over the appointment, light rain, no track maintenance, and that the train is delayed, for example. So I take both of the possible values for train, go ahead and add them up. These are just joint probabilities that we saw earlier, how to calculate just by going parent, parent, parent, parent, and calculating those probabilities and multiplying them together. And then you’ll need to normalize them at the end, speaking at a high level, to make sure that everything adds up to the number 1. So the formula for how you do this in a process known as inference by enumeration looks a little bit complicated, but ultimately it looks like this. And let’s now try to distill what it is that all of these symbols actually mean. Let’s start here. What I care about knowing is the probability of x, my query variable, given some sort of evidence. What do I know about conditional probabilities? Well, a conditional probability is proportional to the joint probability. So it is some alpha, some normalizing constant, multiplied by this joint probability of x and evidence. And how do I calculate that? Well, to do that, I’m going to marginalize over all of the hidden variables, all the variables that I don’t directly observe the values for. I’m basically going to iterate over all of the possibilities that it could happen and just sum them all up. And so I can translate this into a sum over all y, which ranges over all the possible hidden variables and the values that they could take on, and adds up all of those possible individual probabilities. And that is going to allow me to do this process of inference by enumeration. Now, ultimately, it’s pretty annoying if we as humans have to do all this math for ourselves. But turns out this is where computers and AI can be particularly helpful, that we can program a computer to understand a Bayesian network, to be able to understand these inference procedures, and to be able to do these calculations. And using the information you’ve seen here, you could implement a Bayesian network from scratch yourself. But turns out there are a lot of libraries, especially written in Python, that allow us to make it easier to do this sort of probabilistic inference, to be able to take a Bayesian network and do these sorts of calculations, so that you don’t need to know and understand all of the underlying math, though it’s helpful to have a general sense for how it works. But you just need to be able to describe the structure of the network and make queries in order to be able to produce the result. And so let’s take a look at an example of that right now. It turns out that there are a lot of possible libraries that exist in Python for doing this sort of inference. It doesn’t matter too much which specific library you use. They all behave in fairly similar ways. But the library I’m going to use here is one known as pomegranate. And here inside of model.py, I have defined a Bayesian network, just using the structure and the syntax that the pomegranate library expects. And what I’m effectively doing is just, in Python, creating nodes to represent each of the nodes of the Bayesian network that you saw me describe a moment ago. So here on line four, after I’ve imported pomegranate, I’m defining a variable called rain that is going to represent a node inside of my Bayesian network. It’s going to be a node that follows this distribution, where there are three possible values, none for no rain, light for light rain, heavy for heavy rain. And these are the probabilities of each of those taking place. 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. Then after that, we go to the next variable, the variable for track maintenance, for example, which is dependent upon that rain variable. And this, instead of being an unconditional distribution, is a conditional distribution, as indicated by a conditional probability table here. And the idea is that I’m following this is conditional on the distribution of rain. So if there is no rain, then the chance that there is, yes, track maintenance is 0.4. If there’s no rain, the chance that there is no track maintenance is 0.6. Likewise, for light rain, I have a distribution. For heavy rain, I have a distribution as well. But I’m effectively encoding the same information you saw represented graphically a moment ago. But I’m telling this Python program that the maintenance node obeys this particular conditional probability distribution. And we do the same thing for the other random variables as well. Train was a node inside my distribution that was a conditional probability table with two parents. It was dependent not only on rain, but also on track maintenance. And so here I’m saying something like, given that there is no rain and, yes, track maintenance, the probability that my train is on time is 0.8. And the probability that it’s delayed is 0.2. And likewise, I can do the same thing for all of the other possible values of the parents of the train node inside of my Bayesian network by saying, for all of those possible values, here is the distribution that the train node should follow. Then I do the same thing for an appointment based on the distribution of the variable train. Then at the end, what I do is actually construct this network by describing what the states of the network are and by adding edges between the dependent nodes. So I create a new Bayesian network, add states to it, one for rain, one for maintenance, one for the train, one for the appointment. And then I add edges connecting the related pieces. Rain has an arrow to maintenance because rain influences track maintenance. Rain also influences the train. Maintenance also influences the train. And train influences whether I make it to my appointment and bake just finalizes the model and does some additional computation. So the specific syntax of this is not really the important part. Pomegranate just happens to be one of several different libraries that can all be used for similar purposes. And you could describe and define a library for yourself that implemented similar things. But the key idea here is that someone can design a library for a general Bayesian network that has nodes that are based upon its parents. And then all a programmer needs to do using one of those libraries is to define what those nodes and what those probability distributions are. And we can begin to do some interesting logic based on it. So let’s try doing that conditional or joint probability calculation that we saw us do by hand before by going into likelihood.py, where here I’m importing the model that I just defined a moment ago. And here I’d just like to calculate model.probability, which calculates the probability for a given observation. And I’d like to calculate the probability of no rain, no track maintenance, my train is on time, and I’m able to attend the meeting. So sort of the optimal scenario that there is no rain and no maintenance on the track, my train is on time, and I’m able to attend the meeting. What is the probability that all of that actually happens? And I can calculate that using the library and just print out its probability. And so I’ll go ahead and run python of likelihood.py. And I see that, OK, the probability is about 0.34. So about a third of the time, everything goes right for me in this case. No rain, no track maintenance, train is on time, and I’m able to attend the meeting. But I could experiment with this, try and calculate other probabilities as well. What’s the probability that everything goes right up until the train, but I still miss my meeting? So no rain, no track maintenance, train is on time, but I miss the appointment. Let’s calculate that probability. And all right, that has a probability of about 0.04. So about 4% of the time, the train will be on time, there won’t be any rain, no track maintenance, and yet I’ll still miss the meeting. And so this is really just an implementation of the calculation of the joint probabilities that we did before. What this library is likely doing is first figuring out the probability of no rain, then figuring out the probability of no track maintenance given no rain, then the probability that my train is on time given both of these values, and then the probability that I miss my appointment given that I know that the train was on time. So this, again, is the calculation of that joint probability. And turns out we can also begin to have our computer solve inference problems as well, to begin to infer, based on information, evidence that we see, what is the likelihood of other variables also being true. So let’s go into inference.py, for example. We’re here, I’m again importing that exact same model from before, importing all the nodes and all the edges and the probability distribution that is encoded there as well. And now there’s a function for doing some sort of prediction. And here, into this model, I pass in the evidence that I observe. So here, I’ve encoded into this Python program the evidence that I have observed. I have observed the fact that the train is delayed. And that is the value for one of the four random variables inside of this Bayesian network. And using that information, I would like to be able to draw inspiration and figure out inferences about the values of the other random variables that are inside of my Bayesian network. I would like to make predictions about everything else. So all of the actual computational logic is happening in just these three lines, where I’m making this call to this prediction. Down below, I’m just iterating over all of the states and all the predictions and just printing them out so that we can visually see what the results are. But let’s find out, given the train is delayed, what can I predict about the values of the other random variables? Let’s go ahead and run python inference.py. I run that, and all right, here is the result that I get. Given the fact that I know that the train is delayed, this is evidence that I have observed. Well, given that there is a 45% chance or a 46% chance that there was no rain, a 31% chance there was light rain, a 23% chance there was heavy rain, I can see a probability distribution of a track maintenance and a probability distribution over whether I’m able to attend or miss my appointment. Now, we know that whether I attend or miss the appointment, that is only dependent upon the train being delayed or not delayed. It shouldn’t depend on anything else. So let’s imagine, for example, that I knew that there was heavy rain. That shouldn’t affect the distribution for making the appointment. And indeed, if I go up here and add some evidence, say that I know that the value of rain is heavy. That is evidence that I now have access to. I now have two pieces of evidence. I know that the rain is heavy, and I know that my train is delayed. I can calculate the probability by running this inference procedure again and seeing the result. I know that the rain is heavy. I know my train is delayed. The probability distribution for track maintenance changed. Given that I know that there’s heavy rain, now it’s more likely that there is no track maintenance, 88%, as opposed to 64% from here before. And now, what is the probability that I make the appointment? Well, that’s the same as before. It’s still going to be attend the appointment with probability 0.6, missed the appointment with probability 0.4, because it was only dependent upon whether or not my train was on time or delayed. And so this here is implementing that idea of that inference algorithm to be able to figure out, based on the evidence that I have, what can we infer about the values of the other variables that exist as well. So inference by enumeration is one way of doing this inference procedure, just looping over all of the values the hidden variables could take on and figuring out what the probability is. Now, it turns out this is not particularly efficient. And there are definitely optimizations you can make by avoiding repeated work. If you’re calculating the same sort of probability multiple times, there are ways of optimizing the program to avoid having to recalculate the same probabilities again and again. But even then, as the number of variables get large, as the number of possible values of variables could take on, get large, we’re going to start to have to do a lot of computation, a lot of calculation, to be able to do this inference. And at that point, it might start to get unreasonable, in terms of the amount of time that it would take to be able to do this sort of exact inference. And it’s for that reason that oftentimes, when it comes towards probability and things we’re not entirely sure about, we don’t always care about doing exact inference and knowing exactly what the probability is. But if we can approximate the inference procedure, do some sort of approximate inference, that that can be pretty good as well. That if I don’t know the exact probability, but I have a general sense for the probability that I can get increasingly accurate with more time, that that’s probably pretty good, especially if I can get that to happen even faster. So how could I do approximate inference inside of a Bayesian network? Well, one method is through a procedure known as sampling. In the process of sampling, I’m going to take a sample of all of the variables inside of this Bayesian network here. And how am I going to sample? Well, I’m going to sample one of the values from each of these nodes according to their probability distribution. So how might I take a sample of all these nodes? Well, I’ll start at the root. I’ll start with rain. Here’s the distribution for rain. And I’ll go ahead and, using a random number generator or something like it, randomly pick one of these three values. I’ll pick none with probability 0.7, light with probability 0.2, and heavy with probability 0.1. So I’ll randomly just pick one of them according to that distribution. And maybe in this case, I pick none, for example. Then I do the same thing for the other variable. Maintenance also has a probability distribution. And I’m going to sample. Now, there are three probability distributions here. But I’m only going to sample from this first row here, because I’ve observed already in my sample that the value of rain is none. So given that rain is none, I’m going to sample from this distribution to say, all right, what should the value of maintenance be? And in this case, maintenance is going to be, let’s just say yes, which happens 40% of the time in the event that there is no rain, for example. And we’ll sample all of the rest of the nodes in this way as well, that I want to sample from the train distribution. And I’ll sample from this first row here, where there is no rain, but there is track maintenance. And I’ll sample 80% of the time. I’ll say the train is on time. 20% of the time, I’ll say the train is delayed. And finally, we’ll do the same thing for whether I make it to my appointment or not. Did I attend or miss the appointment? We’ll sample based on this distribution and maybe say that in this case, I attend the appointment, which happens 90% of the time when the train is actually on time. So by going through these nodes, I can very quickly just do some sampling and get a sample of the possible values that could come up from going through this entire Bayesian network according to those probability distributions. And where this becomes powerful is if I do this not once, but I do this thousands or tens of thousands of times and generate a whole bunch of samples all using this distribution. I get different samples. Maybe some of them are the same. But I get a value for each of the possible variables that could come up. And so then if I’m ever faced with a question, a question like, what is the probability that the train is on time, you could do an exact inference procedure. This is no different than the inference problem we had before where I could just marginalize, look at all the possible other values of the variables, and do the computation of inference by enumeration to find out this probability exactly. But I could also, if I don’t care about the exact probability, just sample it, approximate it to get close. And this is a powerful tool in AI where we don’t need to be right 100% of the time or we don’t need to be exactly right. If we just need to be right with some probability, we can often do so more effectively, more efficiently. And so if here now are all of those possible samples, I’ll highlight the ones where the train is on time. I’m ignoring the ones where the train is delayed. And in this case, there’s like six out of eight of the samples have the train is arriving on time. And so maybe in this case, I can say that in six out of eight cases, that’s the likelihood that the train is on time. And with eight samples, that might not be a great prediction. But if I had thousands upon thousands of samples, then this could be a much better inference procedure to be able to do these sorts of calculations. So this is a direct sampling method to just do a bunch of samples and then figure out what the probability of some event is. Now, this from before was an unconditional probability. What is the probability that the train is on time? And I did that by looking at all the samples and figuring out, right, here are the ones where the train is on time. But sometimes what I want to calculate is not an unconditional probability, but rather a conditional probability, something like what is the probability that there is light rain, given that the train is on time, something to that effect. And to do that kind of calculation, well, what I might do is here are all the samples that I have. And I want to calculate a probability distribution, given that I know that the train is on time. So to be able to do that, I can kind of look at the two cases where the train was delayed and ignore or reject them, sort of exclude them from the possible samples that I’m considering. And now I want to look at these remaining cases where the train is on time. Here are the cases where there is light rain. And I say, OK, these are two out of the six possible cases. That can give me an approximation for the probability of light rain, given the fact that I know the train was on time. And I did that in almost exactly the same way, just by adding an additional step, by saying that, all right, when I take each sample, let me reject all of the samples that don’t match my evidence and only consider the samples that do match what it is that I have in my evidence that I want to make some sort of calculation about. And it turns out, using the libraries that we’ve had for Bayesian networks, we can begin to implement this same sort of idea, like implement rejection sampling, which is what this method is called, to be able to figure out some probability, not via direct inference, but instead by sampling. So what I have here is a program called sample.py. Imports the exact same model. And what I define first is a program to generate a sample. And the way I generate a sample is just by looping over all of the states. The states need to be in some sort of order to make sure I’m looping in the correct order. But effectively, if it is a conditional distribution, I’m going to sample based on the parents. And otherwise, I’m just going to directly sample the variable, like rain, which has no parents. It’s just an unconditional distribution and keep track of all those parent samples and return the final sample. The exact syntax of this, again, not particularly important. It just happens to be part of the implementation details of this particular library. The interesting logic is down below. Now that I have the ability to generate a sample, if I want to know the distribution of the appointment random variable, given that the train is delayed, well, then I can begin to do calculations like this. Let me take 10,000 samples and assemble all my results in this list called data. I’ll go ahead and loop n times, in this case, 10,000 times. I’ll generate a sample. And I want to know the distribution of appointment, given that the train is delayed. So according to rejection sampling, I’m only going to consider samples where the train is delayed. If the train is not delayed, I’m not going to consider those values at all. So I’m going to say, all right, if I take the sample, look at the value of the train random variable, if the train is delayed, well, let me go ahead and add to my data that I’m collecting the value of the appointment random variable that it took on in this particular sample. So I’m only considering the samples where the train is delayed. And for each of those samples, considering what the value of appointment is, and then at the end, I’m using a Python class called counter, which quickly counts up all the values inside of a data set. So I can take this list of data and figure out how many times was my appointment made and how many times was my appointment missed. And so this here, with just a couple lines of code, is an implementation of rejection sampling. And I can run it by going ahead and running Python sample.py. And when I do that, here is the result I get. This is the result of the counter. 1,251 times, I was able to attend the meeting. And 856 times, I was able to miss the meeting. And you can imagine, by doing more and more samples, I’ll be able to get a better and better, more accurate result. And this is a randomized process. It’s going to be an approximation of the probability. If I run it a different time, you’ll notice the numbers are similar, 12, 72, and 905. But they’re not identical because there’s some randomization, some likelihood that things might be higher or lower. And so this is why we generally want to try and use more samples so that we can have a greater amount of confidence in our result, be more sure about the result that we’re getting of whether or not it accurately reflects or represents the actual underlying probabilities that are inherent inside of this distribution. And so this, then, was an instance of rejection sampling. And it turns out there are a number of other sampling methods that you could use to begin to try to sample. One problem that rejection sampling has is that if the evidence you’re looking for is a fairly unlikely event, well, you’re going to be rejecting a lot of samples. Like if I’m looking for the probability of x given some evidence e, if e is very unlikely to occur, like occurs maybe one every 1,000 times, then I’m only going to be considering 1 out of every 1,000 samples that I do, which is a pretty inefficient method for trying to do this sort of calculation. I’m throwing away a lot of samples. And it takes computational effort to be able to generate those samples. So I’d like to not have to do something like that. So there are other sampling methods that can try and address this. One such sampling method is called likelihood weighting. In likelihood weighting, we follow a slightly different procedure. And the goal is to avoid needing to throw out samples that didn’t match the evidence. And so what we’ll do is we’ll start by fixing the values for the evidence variables. Rather than sample everything, we’re going to fix the values of the evidence variables and not sample those. Then we’re going to sample all the other non-evidence variables in the same way, just using the Bayesian network looking at the probability distributions, sampling all the non-evidence variables. But then what we need to do is weight each sample by its likelihood. If our evidence is really unlikely, we want to make sure that we’ve taken into account how likely was the evidence to actually show up in the sample. If I have a sample where the evidence was much more likely to show up than another sample, then I want to weight the more likely one higher. So we’re going to weight each sample by its likelihood, where likelihood is just defined as the probability of all the evidence. Given all the evidence we have, what is the probability that it would happen in that particular sample? So before, all of our samples were weighted equally. They all had a weight of 1 when we were calculating the overall average. In this case, we’re going to weight each sample, multiply each sample by its likelihood in order to get the more accurate distribution. So what would this look like? Well, if I ask the same question, what is the probability of light rain, given that the train is on time, when I do the sampling procedure and start by trying to sample, I’m going to start by fixing the evidence variable. I’m already going to have in my sample the train is on time. That way, I don’t have to throw out anything. I’m only sampling things where I know the value of the variables that are my evidence are what I expect them to be. So I’ll go ahead and sample from rain. And maybe this time, I sample light rain instead of no rain. Then I’ll sample from track maintenance and say, maybe, yes, there’s track maintenance. Then for train, well, I’ve already fixed it in place. Train was an evidence variable. So I’m not going to bother sampling again. I’ll just go ahead and move on. I’ll move on to appointment and go ahead and sample from appointment as well. So now I’ve generated a sample. I’ve generated a sample by fixing this evidence variable and sampling the other three. And the last step is now weighting the sample. How much weight should it have? And the weight is based on how probable is it that the train was actually on time, this evidence actually happened, given the values of these other variables, light rain and the fact that, yes, there was track maintenance. Well, to do that, I can just go back to the train variable and say, all right, if there was light rain and track maintenance, the likelihood of my evidence, the likelihood that my train was on time, is 0.6. And so this particular sample would have a weight of 0.6. And I could repeat the sampling procedure again and again. Each time every sample would be given a weight according to the probability of the evidence that I see associated with it. And there are other sampling methods that exist as well, but all of them are designed to try and get it the same idea, to approximate the inference procedure of figuring out the value of a variable. So we’ve now dealt with probability as it pertains to particular variables that have these discrete values. But what we haven’t really considered is how values might change over time. That we’ve considered something like a variable for rain, where rain can take on values of none or light rain or heavy rain. But in practice, usually when we consider values for variables like rain, we like to consider it for over time, how do the values of these variables change? What do we do with when we’re dealing with uncertainty over a period of time, which can come up in the context of weather, for example, if I have sunny days and I have rainy days. And I’d like to know not just what is the probability that it’s raining now, but what is the probability that it rains tomorrow, or the day after that, or the day after that. And so to do this, we’re going to introduce a slightly different kind of model. But here, we’re going to have a random variable, not just one for the weather, but for every possible time step. And you can define time step however you like. A simple way is just to use days as your time step. And so we can define a variable called x sub t, which is going to be the weather at time t. So x sub 0 might be the weather on day 0. x sub 1 might be the weather on day 1, so on and so forth. x sub 2 is the weather on day 2. But as you can imagine, if we start to do this over longer and longer periods of time, there’s an incredible amount of data that might go into this. If you’re keeping track of data about the weather for a year, now suddenly you might be trying to predict the weather tomorrow, given 365 days of previous pieces of evidence. And that’s a lot of evidence to have to deal with and manipulate and calculate. Probably nobody knows what the exact conditional probability distribution is for all of those combinations of variables. And so when we’re trying to do this inference inside of a computer, when we’re trying to reasonably do this sort of analysis, it’s helpful to make some simplifying assumptions, some assumptions about the problem that we can just assume are true, to make our lives a little bit easier. Even if they’re not totally accurate assumptions, if they’re close to accurate or approximate, they’re usually pretty good. And the assumption we’re going to make is called the Markov assumption, which is the assumption that the current state depends only on a finite fixed number of previous states. So the current day’s weather depends not on all the previous day’s weather for the rest of all of history, but the current day’s weather I can predict just based on yesterday’s weather, or just based on the last two days weather, or the last three days weather. But oftentimes, we’re going to deal with just the one previous state that helps to predict this current state. And by putting a whole bunch of these random variables together, using this Markov assumption, we can create what’s called a Markov chain, where a Markov chain is just some sequence of random variables where each of the variables distribution follows that Markov assumption. And so we’ll do an example of this where the Markov assumption is, I can predict the weather. Is it sunny or rainy? And we’ll just consider those two possibilities for now, even though there are other types of weather. But I can predict each day’s weather just on the prior day’s weather, using today’s weather, I can come up with a probability distribution for tomorrow’s weather. And here’s what this weather might look like. It’s formatted in terms of a matrix, as you might describe it, as rows and columns of values, where on the left-hand side, I have today’s weather, represented by the variable x sub t. And over here in the columns, I have tomorrow’s weather, represented by the variable x sub t plus 1, t plus 1 day’s weather instead. And what this matrix is saying is, if today is sunny, well, then it’s more likely than not that tomorrow is also sunny. Oftentimes, the weather stays consistent for multiple days in a row. And for example, let’s say that if today is sunny, our model says that tomorrow, with probability 0.8, it will also be sunny. And with probability 0.2, it will be raining. And likewise, if today is raining, then it’s more likely than not that tomorrow is also raining. With probability 0.7, it’ll be raining. With probability 0.3, it will be sunny. So this matrix, this description of how it is we transition from one state to the next state is what we’re going to call the transition model. And using the transition model, you can begin to construct this Markov chain by just predicting, given today’s weather, what’s the likelihood of tomorrow’s weather happening. And you can imagine doing a similar sampling procedure, where you take this information, you sample what tomorrow’s weather is going to be. Using that, you sample the next day’s weather. And the result of that is you can form this Markov chain of like x0, time and time, day zero is sunny, the next day is sunny, maybe the next day it changes to raining, then raining, then raining. And the pattern that this Markov chain follows, given the distribution that we had access to, this transition model here, is that when it’s sunny, it tends to stay sunny for a little while. The next couple of days tend to be sunny too. And when it’s raining, it tends to be raining as well. And so you get a Markov chain that looks like this, and you can do analysis on this. You can say, given that today is raining, what is the probability that tomorrow is raining? Or you can begin to ask probability questions like, what is the probability of this sequence of five values, sun, sun, rain, rain, rain, and answer those sorts of questions too. And it turns out there are, again, many Python libraries for interacting with models like this of probabilities that have distributions and random variables that are based on previous variables according to this Markov assumption. And pomegranate2 has ways of dealing with these sorts of variables. So I’ll go ahead and go into the chain directory, where I have some information about Markov chains. And here, I’ve defined a file called model.py, where I’ve defined in a very similar syntax. And again, the exact syntax doesn’t matter so much as the idea that I’m encoding this information into a Python program so that the program has access to these distributions. I’ve here defined some starting distribution. So every Markov model begins at some point in time, and I need to give it some starting distribution. And so we’ll just say, you know at the start, you can pick 50-50 between sunny and rainy. We’ll say it’s sunny 50% of the time, rainy 50% of the time. And then down below, I’ve here defined the transition model, how it is that I transition from one day to the next. And here, I’ve encoded that exact same matrix from before, that if it was sunny today, then with probability 0.8, it will be sunny tomorrow. And it’ll be rainy tomorrow with probability 0.2. And I likewise have another distribution for if it was raining today instead. And so that alone defines the Markov model. You can begin to answer questions using that model. But one thing I’ll just do is sample from the Markov chain. It turns out there is a method built into this Markov chain library that allows me to sample 50 states from the chain, basically just simulating like 50 instances of weather. And so let me go ahead and run this. Python model.py. And when I run it, what I get is that it’s going to sample from this Markov chain 50 states, 50 days worth of weather that it’s just going to randomly sample. And you can imagine sampling many times to be able to get more data, to be able to do more analysis. But here, for example, it’s sunny two days in a row, rainy a whole bunch of days in a row before it changes back to sun. And so you get this model that follows the distribution that we originally described, that follows the distribution of sunny days tend to lead to more sunny days. Rainy days tend to lead to more rainy days. And that then is a Markov model. And Markov models rely on us knowing the values of these individual states. I know that today is sunny or that today is raining. And using that information, I can draw some sort of inference about what tomorrow is going to be like. But in practice, this often isn’t the case. It often isn’t the case that I know for certain what the exact state of the world is. Oftentimes, the state of the world is exactly unknown. But I’m able to somehow sense some information about that state, that a robot or an AI doesn’t have exact knowledge about the world around it. But it has some sort of sensor, whether that sensor is a camera or sensors that detect distance or just a microphone that is sensing audio, for example. It is sensing data. And using that data, that data is somehow related to the state of the world, even if it doesn’t actually know, our AI doesn’t know, what the underlying true state of the world actually is. And for that, we need to get into the world of sensor models, the way of describing how it is that we translate what the hidden state, the underlying true state of the world, is with what the observation, what it is that the AI knows or the AI has access to, actually is. And so for example, a hidden state might be a robot’s position. If a robot is exploring new uncharted territory, the robot likely doesn’t know exactly where it is. But it does have an observation. It has robot sensor data, where it can sense how far away are possible obstacles around it. And using that information, using the observed information that it has, it can infer something about the hidden state. Because what the true hidden state is influences those observations. Whatever the robot’s true position is affects or has some effect upon what the sensor data of the robot is able to collect is, even if the robot doesn’t actually know for certain what its true position is. Likewise, if you think about a voice recognition or a speech recognition program that listens to you and is able to respond to you, something like Alexa or what Apple and Google are doing with their voice recognition as well, that you might imagine that the hidden state, the underlying state, is what words are actually spoken. The true nature of the world contains you saying a particular sequence of words, but your phone or your smart home device doesn’t know for sure exactly what words you said. The only observation that the AI has access to is some audio waveforms. And those audio waveforms are, of course, dependent upon this hidden state. And you can infer, based on those audio waveforms, what the words spoken likely were. But you might not know with 100% certainty what that hidden state actually is. And it might be a task to try and predict, given this observation, given these audio waveforms, can you figure out what the actual words spoken are. And likewise, you might imagine on a website, true user engagement. Might be information you don’t directly have access to. But you can observe data, like website or app analytics, about how often was this button clicked or how often are people interacting with a page in a particular way. And you can use that to infer things about your users as well. So this type of problem comes up all the time when we’re dealing with AI and trying to infer things about the world. That often AI doesn’t really know the hidden true state of the world. All the AI has access to is some observation that is related to the hidden true state. But it’s not direct. There might be some noise there. The audio waveform might have some additional noise that might be difficult to parse. The sensor data might not be exactly correct. There’s some noise that might not allow you to conclude with certainty what the hidden state is, but can allow you to infer what it might be. And so the simple example we’ll take a look at here is imagining the hidden state as the weather, whether it’s sunny or rainy or not. And imagine you are programming an AI inside of a building that maybe has access to just a camera to inside the building. And all you have access to is an observation as to whether or not employees are bringing an umbrella into the building or not. You can detect whether it’s an umbrella or not. And so you might have an observation as to whether or not an umbrella is brought into the building or not. And using that information, you want to predict whether it’s sunny or rainy, even if you don’t know what the underlying weather is. So the underlying weather might be sunny or rainy. And if it’s raining, obviously people are more likely to bring an umbrella. And so whether or not people bring an umbrella, your observation, tells you something about the hidden state. And of course, this is a bit of a contrived example, but the idea here is to think about this more broadly in terms of more generally, any time you observe something, it having to do with some underlying hidden state. And so to try and model this type of idea where we have these hidden states and observations, rather than just use a Markov model, which has state, state, state, state, each of which is connected by that transition matrix that we described before, we’re going to use what we call a hidden Markov model. Very similar to a Markov model, but this is going to allow us to model a system that has hidden states that we don’t directly observe, along with some observed event that we do actually see. And so in addition to that transition model that we still need of saying, given the underlying state of the world, if it’s sunny or rainy, what’s the probability of tomorrow’s weather? We also need another model that, given some state, is going to give us an observation of green, yes, someone brings an umbrella into the office, or red, no, nobody brings umbrellas into the office. And so the observation might be that if it’s sunny, then odds are nobody is going to bring an umbrella to the office. But maybe some people are just being cautious, and they do bring an umbrella to the office anyways. And if it’s raining, then with much higher probability, then people are going to bring umbrellas into the office. But maybe if the rain was unexpected, people didn’t bring an umbrella. And so it might have some other probability as well. And so using the observations, you can begin to predict with reasonable likelihood what the underlying state is, even if you don’t actually get to observe the underlying state, if you don’t get to see what the hidden state is actually equal to. This here we’ll often call the sensor model. It’s also often called the emission probabilities, because the state, the underlying state, emits some sort of emission that you then observe. And so that can be another way of describing that same idea. And the sensor Markov assumption that we’re going to use is this assumption that the evidence variable, the thing we observe, the emission that gets produced, depends only on the corresponding state, meaning it can predict whether or not people will bring umbrellas or not entirely dependent just on whether it is sunny or rainy today. Of course, again, this assumption might not hold in practice, that in practice, it might depend whether or not people bring umbrellas, might depend not just on today’s weather, but also on yesterday’s weather and the day before. But for simplification purposes, it can be helpful to apply this sort of assumption just to allow us to be able to reason about these probabilities a little more easily. And if we’re able to approximate it, we can still often get a very good answer. And so what these hidden Markov models end up looking like is a little something like this, where now, rather than just have one chain of states, like sun, sun, rain, rain, rain, we instead have this upper level, which is the underlying state of the world. Is it sunny or is it rainy? And those are connected by that transition matrix we described before. But each of these states produces an emission, produces an observation that I see, that on this day, it was sunny and people didn’t bring umbrellas. And on this day, it was sunny, but people did bring umbrellas. And on this day, it was raining and people did bring umbrellas, and so on and so forth. And so each of these underlying states represented by x sub t for x sub 1, 0, 1, 2, so on and so forth, produces some sort of observation or emission, which is what the e stands for, e sub 0, e sub 1, e sub 2, so on and so forth. And so this, too, is a way of trying to represent this idea. And what you want to think about is that these underlying states are the true nature of the world, the robot’s position as it moves over time, and that produces some sort of sensor data that might be observed, or what people are actually saying and using the emission data of what audio waveforms do you detect in order to process that data and try and figure it out. And there are a number of possible tasks that you might want to do given this kind of information. And one of the simplest is trying to infer something about the future or the past or about these sort of hidden states that might exist. And so the tasks that you’ll often see, and we’re not going to go into the mathematics of these tasks, but they’re all based on the same idea of conditional probabilities and using the probability distributions we have to draw these sorts of conclusions. One task is called filtering, which is given observations from the start until now, calculate the distribution for the current state, meaning given information about from the beginning of time until now, on which days do people bring an umbrella or not bring an umbrella, can I calculate the probability of the current state that today, is it sunny or is it raining? Another task that might be possible is prediction, which is looking towards the future. Given observations about people bringing umbrellas from the beginning of when we started counting time until now, can I figure out the distribution that tomorrow is it sunny or is it raining? And you can also go backwards as well by a smoothing, where I can say given observations from start until now, calculate the distributions for some past state. Like I know that today people brought umbrellas and tomorrow people brought umbrellas. And so given two days worth of data of people bringing umbrellas, what’s the probability that yesterday it was raining? And that I know that people brought umbrellas today, that might inform that decision as well. It might influence those probabilities. And there’s also a most likely explanation task, in addition to other tasks that might exist as well, which is combining some of these given observations from the start up until now, figuring out the most likely sequence of states. And this is what we’re going to take a look at now, this idea that if I have all these observations, umbrella, no umbrella, umbrella, no umbrella, can I calculate the most likely states of sun, rain, sun, rain, and whatnot that actually represented the true weather that would produce these observations? And this is quite common when you’re trying to do something like voice recognition, for example, that you have these emissions of the audio waveforms, and you would like to calculate based on all of the observations that you have, what is the most likely sequence of actual words, or syllables, or sounds that the user actually made when they were speaking to this particular device, or other tasks that might come up in that context as well. And so we can try this out by going ahead and going into the HMM directory, HMM for Hidden Markov Model. And here, what I’ve done is I’ve defined a model where this model first defines my possible state, sun, and rain, along with their emission probabilities, the observation model, or the emission model, where here, given that I know that it’s sunny, the probability that I see people bring an umbrella is 0.2, the probability of no umbrella is 0.8. And likewise, if it’s raining, then people are more likely to bring an umbrella. Umbrella has probability 0.9, no umbrella has probability 0.1. So the actual underlying hidden states, those states are sun and rain, but the things that I observe, the observations that I can see, are either umbrella or no umbrella as the things that I observe as a result. So this then, I also need to add to it a transition matrix, same as before, saying that if today is sunny, then tomorrow is more likely to be sunny. And if today is rainy, then tomorrow is more likely to be raining. As of before, I give it some starting probabilities, saying at first, 50-50 chance for whether it’s sunny or rainy. And then I can create the model based on that information. Again, the exact syntax of this is not so important, so much as it is the data that I am now encoding into a program, such that now I can begin to do some inference. So I can give my program, for example, a list of observations, umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, no umbrella, no umbrella. And I would like to calculate, I would like to figure out the most likely explanation for these observations. What is likely is whether rain, rain, is this rain, or is it more likely that this was actually sunny, and then it switched back to it being rainy? And that’s an interesting question. We might not be sure, because it might just be that it just so happened on this rainy day, people decided not to bring an umbrella. Or it could be that it switched from rainy to sunny back to rainy, which doesn’t seem too likely, but it certainly could happen. And using the data we give to the hidden Markov model, our model can begin to predict these answers, can begin to figure it out. So we’re going to go ahead and just predict these observations. And then for each of those predictions, go ahead and print out what the prediction is. And this library just so happens to have a function called predict that does this prediction process for me. So I’ll run python sequence.py. And the result I get is this. This is the prediction based on the observations of what all of those states are likely to be. And it’s likely to be rain and rain. In this case, it thinks that what most likely happened is that it was sunny for a day and then went back to being rainy. But in different situations, if it was rainy for longer maybe, or if the probabilities were slightly different, you might imagine that it’s more likely that it was rainy all the way through. And it just so happened on one rainy day, people decided not to bring umbrellas. And so here, too, Python libraries can begin to allow for the sort of inference procedure. And by taking what we know and by putting it in terms of these tasks that already exist, these general tasks that work with hidden Markov models, then any time we can take an idea and formulate it as a hidden Markov model, formulate it as something that has hidden states and observed emissions that result from those states, then we can take advantage of these algorithms that are known to exist for trying to do this sort of inference. So now we’ve seen a couple of ways that AI can begin to deal with uncertainty. We’ve taken a look at probability and how we can use probability to describe numerically things that are likely or more likely or less likely to happen than other events or other variables. And using that information, we can begin to construct these standard types of models, things like Bayesian networks and Markov chains and hidden Markov models that all allow us to be able to describe how particular events relate to other events or how the values of particular variables relate to other variables, not for certain, but with some sort of probability distribution. And by formulating things in terms of these models that already exist, we can take advantage of Python libraries that implement these sort of models already and allow us just to be able to use them to produce some sort of resulting effect. So all of this then allows our AI to begin to deal with these sort of uncertain problems so that our AI doesn’t need to know things for certain but can infer based on information it doesn’t know. Next time, we’ll take a look at additional types of problems that we can solve by taking advantage of AI-related algorithms, even beyond the world of the types of problems we’ve already explored. We’ll see you next time. OK. Welcome back, everyone, to an introduction to artificial intelligence with Python. And now, so far, we’ve taken a look at a couple of different types of problems. We’ve seen classical search problems where we’re trying to get from an initial state to a goal by figuring out some optimal path. We’ve taken a look at adversarial search where we have a game-playing agent that is trying to make the best move. We’ve seen knowledge-based problems where we’re trying to use logic and inference to be able to figure out and draw some additional conclusions. And we’ve seen some probabilistic models as well where we might not have certain information about the world, but we want to use the knowledge about probabilities that we do have to be able to draw some conclusions. Today, we’re going to turn our attention to another category of problems generally known as optimization problems, where optimization is really all about choosing the best option from a set of possible options. And we’ve already seen optimization in some contexts, like game-playing, where we’re trying to create an AI that chooses the best move out of a set of possible moves. But what we’ll take a look at today is a category of types of problems and algorithms to solve them that can be used in order to deal with a broader range of potential optimization problems. And the first of the algorithms that we’ll take a look at is known as a local search. And local search differs from search algorithms we’ve seen before in the sense that the search algorithms we’ve looked at so far, which are things like breadth-first search or A-star search, for example, generally maintain a whole bunch of different paths that we’re simultaneously exploring, and we’re looking at a bunch of different paths at once trying to find our way to the solution. On the other hand, in local search, this is going to be a search algorithm that’s really just going to maintain a single node, looking at a single state. And we’ll generally run this algorithm by maintaining that single node and then moving ourselves to one of the neighboring nodes throughout this search process. And this is generally useful in context not like these problems, which we’ve seen before, like a maze-solving situation where we’re trying to find our way from the initial state to the goal by following some path. But local search is most applicable when we really don’t care about the path at all, and all we care about is what the solution is. And in the case of solving a maze, the solution was always obvious. You could point to the solution. You know exactly what the goal is, and the real question is, what is the path to get there? But local search is going to come up in cases where figuring out exactly what the solution is, exactly what the goal looks like, is actually the heart of the challenge. And to give an example of one of these kinds of problems, we’ll consider a scenario where we have two types of buildings, for example. We have houses and hospitals. And our goal might be in a world that’s formatted as this grid, where we have a whole bunch of houses, a house here, house here, two houses over there, maybe we want to try and find a way to place two hospitals on this map. So maybe a hospital here and a hospital there. And the problem now is we want to place two hospitals on the map, but we want to do so with some sort of objective. And our objective in this case is to try and minimize the distance of any of the houses from a hospital. So you might imagine, all right, what’s the distance from each of the houses to their nearest hospital? There are a number of ways we could calculate that distance. But one way is using a heuristic we’ve looked at before, which is the Manhattan distance, this idea of how many rows and columns would you have to move inside of this grid layout in order to get to a hospital, for example. And it turns out, if you take each of these four houses and figure out, all right, how close are they to their nearest hospital, you get something like this, where this house is three away from a hospital, this house is six away, and these two houses are each four away. And if you add all those numbers up together, you get a total cost of 17, for example. So for this particular configuration of hospitals, a hospital here and a hospital there, that state, we might say, has a cost of 17. And the goal of this problem now that we would like to apply a search algorithm to figure out is, can you solve this problem to find a way to minimize that cost? Minimize the total amount if you sum up all of the distances from all the houses to the nearest hospital. How can we minimize that final value? And if we think about this problem a little bit more abstractly, abstracting away from this specific problem and thinking more generally about problems like it, you can often formulate these problems by thinking about them as a state-space landscape, as we’ll soon call it. Here in this diagram of a state-space landscape, each of these vertical bars represents a particular state that our world could be in. So for example, each of these vertical bars represents a particular configuration of two hospitals. And the height of this vertical bar is generally going to represent some function of that state, some value of that state. So maybe in this case, the height of the vertical bar represents what is the cost of this particular configuration of hospitals in terms of what is the sum total of all the distances from all of the houses to their nearest hospital. And generally speaking, when we have a state-space landscape, we want to do one of two things. We might be trying to maximize the value of this function, trying to find a global maximum, so to speak, of this state-space landscape, a single state whose value is higher than all of the other states that we could possibly choose from. And generally in this case, when we’re trying to find a global maximum, we’ll call the function that we’re trying to optimize some objective function, some function that measures for any given state how good is that state, such that we can take any state, pass it into the objective function, and get a value for how good that state is. And ultimately, what our goal is is to find one of these states that has the highest possible value for that objective function. An equivalent but reversed problem is the problem of finding a global minimum, some state that has a value after you pass it into this function that is lower than all of the other possible values that we might choose from. And generally speaking, when we’re trying to find a global minimum, we call the function that we’re calculating a cost function. Generally, each state has some sort of cost, whether that cost is a monetary cost, or a time cost, or in the case of the houses and hospitals, we’ve been looking at just now, a distance cost in terms of how far away each of the houses is from a hospital. And we’re trying to minimize the cost, find the state that has the lowest possible value of that cost. So these are the general types of ideas we might be trying to go for within a state space landscape, trying to find a global maximum, or trying to find a global minimum. And how exactly do we do that? We’ll recall that in local search, we generally operate this algorithm by maintaining just a single state, just some current state represented inside of some node, maybe inside of a data structure, where we’re keeping track of where we are currently. And then ultimately, what we’re going to do is from that state, move to one of its neighbor states. So in this case, represented in this one-dimensional space by just the state immediately to the left or to the right of it. But for any different problem, you might define what it means for there to be a neighbor of a particular state. In the case of a hospital, for example, that we were just looking at, a neighbor might be moving one hospital one space to the left or to the right or up or down. Some state that is close to our current state, but slightly different, and as a result, might have a slightly different value in terms of its objective function or in terms of its cost function. So this is going to be our general strategy in local search, to be able to take a state, maintaining some current node, and move where we’re looking at in the state space landscape in order to try to find a global maximum or a global minimum somehow. And perhaps the simplest of algorithms that we could use to implement this idea of local search is an algorithm known as hill climbing. And the basic idea of hill climbing is, let’s say I’m trying to maximize the value of my state. I’m trying to figure out where the global maximum is. I’m going to start at a state. And generally, what hill climbing is going to do is it’s going to consider the neighbors of that state, that from this state, all right, I could go left or I could go right, and this neighbor happens to be higher and this neighbor happens to be lower. And in hill climbing, if I’m trying to maximize the value, I’ll generally pick the highest one I can between the state to the left and right of me. This one is higher. So I’ll go ahead and move myself to consider that state instead. And then I’ll repeat this process, continually looking at all of my neighbors and picking the highest neighbor, doing the same thing, looking at my neighbors, picking the highest of my neighbors, until I get to a point like right here, where I consider both of my neighbors and both of my neighbors have a lower value than I do. This current state has a value that is higher than any of its neighbors. And at that point, the algorithm terminates. And I can say, all right, here I have now found the solution. And the same thing works in exactly the opposite way for trying to find a global minimum. But the algorithm is fundamentally the same. If I’m trying to find a global minimum and say my current state starts here, I’ll continually look at my neighbors, pick the lowest value that I possibly can, until I eventually, hopefully, find that global minimum, a point at which when I look at both of my neighbors, they each have a higher value. And I’m trying to minimize the total score or cost or value that I get as a result of calculating some sort of cost function. So we can formulate this graphical idea in terms of pseudocode. And the pseudocode for hill climbing might look like this. We define some function called hill climb that takes as input the problem that we’re trying to solve. And generally, we’re going to start in some sort of initial state. So I’ll start with a variable called current that is keeping track of my initial state, like an initial configuration of hospitals. And maybe some problems lend themselves to an initial state, some place where you begin. In other cases, maybe not, in which case we might just randomly generate some initial state, just by choosing two locations for hospitals at random, for example, and figuring out from there how we might be able to improve. But that initial state, we’re going to store inside of current. And now, here comes our loop, some repetitive process we’re going to do again and again until the algorithm terminates. And what we’re going to do is first say, let’s figure out all of the neighbors of the current state. From my state, what are all of the neighboring states for some definition of what it means to be a neighbor? And I’ll go ahead and choose the highest value of all of those neighbors and save it inside of this variable called neighbor. So keep track of the highest-valued neighbor. This is in the case where I’m trying to maximize the value. In the case where I’m trying to minimize the value, you might imagine here, you’ll pick the neighbor with the lowest possible value. But these ideas are really fundamentally interchangeable. And it’s possible, in some cases, there might be multiple neighbors that each have an equally high value or an equally low value in the minimizing case. And in that case, we can just choose randomly from among them. Choose one of them and save it inside of this variable neighbor. And then the key question to ask is, is this neighbor better than my current state? And if the neighbor, the best neighbor that I was able to find, is not better than my current state, well, then the algorithm is over. And I’ll just go ahead and return the current state. If none of my neighbors are better, then I may as well stay where I am, is the general logic of the hill climbing algorithm. But otherwise, if the neighbor is better, then I may as well move to that neighbor. So you might imagine setting current equal to neighbor, where the general idea is if I’m at a current state and I see a neighbor that is better than me, then I’ll go ahead and move there. And then I’ll repeat the process, continually moving to a better neighbor until I reach a point at which none of my neighbors are better than I am. And at that point, we’d say the algorithm can just terminate there. So let’s take a look at a real example of this with these houses and hospitals. So we’ve seen now that if we put the hospitals in these two locations, that has a total cost of 17. And now we need to define, if we’re going to implement this hill climbing algorithm, what it means to take this particular configuration of hospitals, this particular state, and get a neighbor of that state. And a simple definition of neighbor might be just, let’s pick one of the hospitals and move it by one square, the left or right or up or down, for example. And that would mean we have six possible neighbors from this particular configuration. We could take this hospital and move it to any of these three possible squares, or we take this hospital and move it to any of those three possible squares. And each of those would generate a neighbor. And what I might do is say, all right, here’s the locations and the distances between each of the houses and their nearest hospital. Let me consider all of the neighbors and see if any of them can do better than a cost of 17. And it turns out there are a couple of ways that we could do that. And it doesn’t matter if we randomly choose among all the ways that are the best. But one such possible way is by taking a look at this hospital here and considering the directions in which it might move. If we hold this hospital constant, if we take this hospital and move it one square up, for example, that doesn’t really help us. It gets closer to the house up here, but it gets further away from the house down here. And it doesn’t really change anything for the two houses along the left-hand side. But if we take this hospital on the right and move it one square down, it’s the opposite problem. It gets further away from the house up above, and it gets closer to the house down below. The real idea, the goal should be to be able to take this hospital and move it one square to the left. By moving it one square to the left, we move it closer to both of these houses on the right without changing anything about the houses on the left. For them, this hospital is still the closer one, so they aren’t affected. So we’re able to improve the situation by picking a neighbor that results in a decrease in our total cost. And so we might do that. Move ourselves from this current state to a neighbor by just taking that hospital and moving it. And at this point, there’s not a whole lot that can be done with this hospital. But there’s still other optimizations we can make, other neighbors we can move to that are going to have a better value. If we consider this hospital, for example, we might imagine that right now it’s a bit far up, that both of these houses are a little bit lower. So we might be able to do better by taking this hospital and moving it one square down, moving it down so that now instead of a cost of 15, we’re down to a cost of 13 for this particular configuration. And we can do even better by taking the hospital and moving it one square to the left. Now instead of a cost of 13, we have a cost of 11, because this house is one away from the hospital. This one is four away. This one is three away. And this one is also three away. So we’ve been able to do much better than that initial cost that we had using the initial configuration. Just by taking every state and asking ourselves the question, can we do better by just making small incremental changes, moving to a neighbor, moving to a neighbor, and moving to a neighbor after that? And now at this point, we can potentially see that at this point, the algorithm is going to terminate. There’s actually no neighbor we can move to that is going to improve the situation, get us a cost that is less than 11. Because if we take this hospital and move it upper to the right, well, that’s going to make it further away. If we take it and move it down, that doesn’t really change the situation. It gets further away from this house but closer to that house. And likewise, the same story was true for this hospital. Any neighbor we move it to, up, left, down, or right, is either going to make it further away from the houses and increase the cost, or it’s going to have no effect on the cost whatsoever. And so the question we might now ask is, is this the best we could do? Is this the best placement of the hospitals we could possibly have? And it turns out the answer is no, because there’s a better way that we could place these hospitals. And in particular, there are a number of ways you could do this. But one of the ways is by taking this hospital here and moving it to this square, for example, moving it diagonally by one square, which was not part of our definition of neighbor. We could only move left, right, up, or down. But this is, in fact, better. It has a total cost of 9. It is now closer to both of these houses. And as a result, the total cost is less. But we weren’t able to find it, because in order to get there, we had to go through a state that actually wasn’t any better than the current state that we had been on previously. And so this appears to be a limitation, or a concern you might have as you go about trying to implement a hill climbing algorithm, is that it might not always give you the optimal solution. If we’re trying to maximize the value of any particular state, we’re trying to find the global maximum, a concern might be that we could get stuck at one of the local maxima, highlighted here in blue, where a local maxima is any state whose value is higher than any of its neighbors. If we ever find ourselves at one of these two states when we’re trying to maximize the value of the state, we’re not going to make any changes. We’re not going to move left or right. We’re not going to move left here, because those states are worse. But yet, we haven’t found the global optimum. We haven’t done as best as we could do. And likewise, in the case of the hospitals, what we’re ultimately trying to do is find a global minimum, find a value that is lower than all of the others. But we have the potential to get stuck at one of the local minima, any of these states whose value is lower than all of its neighbors, but still not as low as the local minima. And so the takeaway here is that it’s not always going to be the case that when we run this naive hill climbing algorithm, that we’re always going to find the optimal solution. There are things that could go wrong. If we started here, for example, and tried to maximize our value as much as possible, we might move to the highest possible neighbor, move to the highest possible neighbor, move to the highest possible neighbor, and stop, and never realize that there’s actually a better state way over there that we could have gone to instead. And other problems you might imagine just by taking a look at this state space landscape are these various different types of plateaus, something like this flat local maximum here, where all six of these states each have the exact same value. And so in the case of the algorithm we showed before, none of the neighbors are better, so we might just get stuck at this flat local maximum. And even if you allowed yourself to move to one of the neighbors, it wouldn’t be clear which neighbor you would ultimately move to, and you could get stuck here as well. And there’s another one over here. This one is called a shoulder. It’s not really a local maximum, because there’s still places where we can go higher, not a local minimum, because we can go lower. So we can still make progress, but it’s still this flat area, where if you have a local search algorithm, there’s potential to get lost here, unable to make some upward or downward progress, depending on whether we’re trying to maximize or minimize it, and therefore another potential for us to be able to find a solution that might not actually be the optimal solution. And so because of this potential, the potential that hill climbing has to not always find us the optimal result, it turns out there are a number of different varieties and variations on the hill climbing algorithm that help to solve the problem better depending on the context, and depending on the specific type of problem, some of these variants might be more applicable than others. What we’ve taken a look at so far is a version of hill climbing generally called steepest ascent hill climbing, where the idea of steepest ascent hill climbing is we are going to choose the highest valued neighbor, in the case where we’re trying to maximize or the lowest valued neighbor in cases where we’re trying to minimize. But generally speaking, if I have five neighbors and they’re all better than my current state, I will pick the best one of those five. Now, sometimes that might work pretty well. It’s sort of a greedy approach of trying to take the best operation at any particular time step, but it might not always work. There might be cases where actually I want to choose an option that is slightly better than me, but maybe not the best one because that later on might lead to a better outcome ultimately. So there are other variants that we might consider of this basic hill climbing algorithm. One is known as stochastic hill climbing. And in this case, we choose randomly from all of our higher value neighbors. So if I’m at my current state and there are five neighbors that are all better than I am, rather than choosing the best one, as steep as the set would do, stochastic will just choose randomly from one of them, thinking that if it’s better, then it’s better. And maybe there’s a potential to make forward progress, even if it is not locally the best option I could possibly choose. First choice hill climbing ends up just choosing the very first highest valued neighbor that it follows, behaving on a similar idea, rather than consider all of the neighbors. As soon as we find a neighbor that is better than our current state, we’ll go ahead and move there. There may be some efficiency improvements there and maybe has the potential to find a solution that the other strategies weren’t able to find. And with all of these variants, we still suffer from the same potential risk, this risk that we might end up at a local minimum or a local maximum. And we can reduce that risk by repeating the process multiple times. So one variant of hill climbing is random restart hill climbing, where the general idea is we’ll conduct hill climbing multiple times. If we apply steepest descent hill climbing, for example, we’ll start at some random state, try and figure out how to solve the problem and figure out what is the local maximum or local minimum we get to. And then we’ll just randomly restart and try again, choose a new starting configuration, try and figure out what the local maximum or minimum is, and do this some number of times. And then after we’ve done it some number of times, we can pick the best one out of all of the ones that we’ve taken a look at. So there’s another option we have access to as well. And then, although I said that generally local search will usually just keep track of a single node and then move to one of its neighbors, there are variants of hill climbing that are known as local beam searches, where rather than keep track of just one current best state, we’re keeping track of k highest valued neighbors, such that rather than starting at one random initial configuration, I might start with 3 or 4 or 5, randomly generate all the neighbors, and then pick the 3 or 4 or 5 best of all of the neighbors that I find, and continually repeat this process, with the idea being that now I have more options that I’m considering, more ways that I could potentially navigate myself to the optimal solution that might exist for a particular problem. So let’s now take a look at some actual code that can implement some of these kinds of ideas, something like steepest ascent hill climbing, for example, for trying to solve this hospital problem. So I’m going to go ahead and go into my hospitals directory, where I’ve actually set up the basic framework for solving this type of problem. I’ll go ahead and go into hospitals.py, and we’ll take a look at the code we’ve created here. I’ve defined a class that is going to represent the state space. So the space has a height, and a width, and also some number of hospitals. So you can configure how big is your map, how many hospitals should go here. We have a function for adding a new house to the state space, and then some functions that are going to get me all of the available spaces for if I want to randomly place hospitals in particular locations. And here now is the hill climbing algorithm. So what are we going to do in the hill climbing algorithm? Well, we’re going to start by randomly initializing where the hospitals are going to go. We don’t know where the hospitals should actually be, so let’s just randomly place them. So here I’m running a loop for each of the hospitals that I have. I’m going to go ahead and add a new hospital at some random location. So I basically get all of the available spaces, and I randomly choose one of them as where I would like to add this particular hospital. I have some logging output and generating some images, which we’ll take a look at a little bit later. But here is the key idea. So I’m going to just keep repeating this algorithm. I could specify a maximum of how many times I want it to run, or I could just run it up until it hits a local maximum or local minimum. And now we’ll basically consider all of the hospitals that could potentially move. So consider each of the two hospitals or more hospitals if they’re more than that. And consider all of the places where that hospital could move to, some neighbor of that hospital that we can move the neighbor to. And then see, is this going to be better than where we were currently? So if it is going to be better, then we’ll go ahead and update our best neighbor and keep track of this new best neighbor that we found. And then afterwards, we can ask ourselves the question, if best neighbor cost is greater than or equal to the cost of the current set of hospitals, meaning if the cost of our best neighbor is greater than the current cost, meaning our best neighbor is worse than our current state, well, then we shouldn’t make any changes at all. And we should just go ahead and return the current set of hospitals. But otherwise, we can update our hospitals in order to change them to one of the best neighbors. And if there are multiple that are all equivalent, I’m here using random.choice to say go ahead and choose one randomly. So this is really just a Python implementation of that same idea that we were just talking about, this idea of taking a current state, some current set of hospitals, generating all of the neighbors, looking at all of the ways we could take one hospital and move it one square to the left or right or up or down, and then figuring out, based on all of that information, which is the best neighbor or the set of all the best neighbors, and then choosing from one of those. And each time, we go ahead and generate an image in order to do that. And so now what we’re doing is if we look down at the bottom, I’m going to randomly generate a space with height 10 and width 20. And I’ll say go ahead and put three hospitals somewhere in the space. I’ll randomly generate 15 houses that I just go ahead and add in random locations. And now I’m going to run this hill climbing algorithm in order to try and figure out where we should place those hospitals. So we’ll go ahead and run this program by running Python hospitals. And we see that we started. Our initial state had a cost of 72, but we were able to continually find neighbors that were able to decrease that cost, decrease to 69, 66, 63, so on and so forth, all the way down to 53, as the best neighbor we were able to ultimately find. And we can take a look at what that looked like by just opening up these files. So here, for example, was the initial configuration. We randomly selected a location for each of these 15 different houses and then randomly selected locations for one, two, three hospitals that were just located somewhere inside of the state space. And if you add up all the distances from each of the houses to their nearest hospital, you get a total cost of about 72. And so now the question is, what neighbors can we move to that improve the situation? And it looks like the first one the algorithm found was by taking this house that was over there on the right and just moving it to the left. And that probably makes sense because if you look at the houses in that general area, really these five houses look like they’re probably the ones that are going to be closest to this hospital over here. Moving it to the left decreases the total distance, at least to most of these houses, though it does increase that distance for one of them. And so we’re able to make these improvements to the situation by continually finding ways that we can move these hospitals around until we eventually settle at this particular state that has a cost of 53, where we figured out a position for each of the hospitals. And now none of the neighbors that we could move to are actually going to improve the situation. We can take this hospital and this hospital and that hospital and look at each of the neighbors. And none of those are going to be better than this particular configuration. And again, that’s not to say that this is the best we could do. There might be some other configuration of hospitals that is a global minimum. And this might just be a local minimum that is the best of all of its neighbors, but maybe not the best in the entire possible state space. And you could search through the entire state space by considering all of the possible configurations for hospitals. But ultimately, that’s going to be very time intensive, especially as our state space gets bigger and there might be more and more possible states. It’s going to take quite a long time to look through all of them. And so being able to use these sort of local search algorithms can often be quite good for trying to find the best solution we can do. And especially if we don’t care about doing the best possible and we just care about doing pretty good and finding a pretty good placement of those hospitals, then these methods can be particularly powerful. But of course, we can try and mitigate some of this concern by instead of using hill climbing to use random restart, this idea of rather than just hill climb one time, we can hill climb multiple times and say, try hill climbing a whole bunch of times on the exact same map and figure out what is the best one that we’ve been able to find. And so I’ve here implemented a function for random restart that restarts some maximum number of times. And what we’re going to do is repeat that number of times this process of just go ahead and run the hill climbing algorithm, figure out what the cost is of getting from all the houses to the hospitals, and then figure out is this better than we’ve done so far. So I can try this exact same idea where instead of running hill climbing, I’ll go ahead and run random restart. And I’ll randomly restart maybe 20 times, for example. And we’ll go ahead and now I’ll remove all the images and then rerun the program. And now we started by finding a original state. When we initially ran hill climbing, the best cost we were able to find was 56. Each of these iterations is a different iteration of the hill climbing algorithm. We’re running hill climbing not one time, but 20 times here, each time going until we find a local minimum in this case. And we look and see each time did we do better than we did the best time we’ve done so far. So we went from 56 to 46. This one was greater, so we ignored it. This one was 41, which was less, so we went ahead and kept that one. And for all of the remaining 16 times that we tried to implement hill climbing and we tried to run the hill climbing algorithm, we couldn’t do any better than that 41. Again, maybe there is a way to do better that we just didn’t find, but it looks like that way ended up being a pretty good solution to the problem. That was attempt number three, starting from counting at zero. So we can take a look at that, open up number three. And this was the state that happened to have a cost of 41, that after running the hill climbing algorithm on some particular random initial configuration of hospitals, this is what we found was the local minimum in terms of trying to minimize the cost. And it looks like we did pretty well. This hospital is pretty close to this region. This one is pretty close to these houses here. This hospital looks about as good as we can do for trying to capture those houses over on that side. And so these sorts of algorithms can be quite useful for trying to solve these problems. But the real problem with many of these different types of hill climbing, steepest of sense, stochastic, first choice, and so forth, is that they never make a move that makes our situation worse. They’re always going to take ourselves in our current state, look at the neighbors, and consider can we do better than our current state and move to one of those neighbors. Which of those neighbors we choose might vary among these various different types of algorithms, but we never go from a current position to a position that is worse than our current position. And ultimately, that’s what we’re going to need to do if we want to be able to find a global maximum or a global minimum. Because sometimes if we get stuck, we want to find some way of dislodging ourselves from our local maximum or local minimum in order to find the global maximum or the global minimum or increase the probability that we do find it. And so the most popular technique for trying to approach the problem from that angle is a technique known as simulated annealing, simulated because it’s modeling after a real physical process of annealing, where you can think about this in terms of physics, a physical situation where you have some system of particles. And you might imagine that when you heat up a particular physical system, there’s a lot of energy there. Things are moving around quite randomly. But over time, as the system cools down, it eventually settles into some final position. And that’s going to be the general idea of simulated annealing. We’re going to simulate that process of some high temperature system where things are moving around randomly quite frequently, but over time decreasing that temperature until we eventually settle at our ultimate solution. And the idea is going to be if we have some state space landscape that looks like this and we begin at its initial state here, if we’re looking for a global maximum and we’re trying to maximize the value of the state, our traditional hill climbing algorithms would just take the state and look at the two neighbor ones and always pick the one that is going to increase the value of the state. But if we want some chance of being able to find the global maximum, we can’t always make good moves. We have to sometimes make bad moves and allow ourselves to make a move in a direction that actually seems for now to make our situation worse such that later we can find our way up to that global maximum in terms of trying to solve that problem. Of course, once we get up to this global maximum, once we’ve done a whole lot of the searching, then we probably don’t want to be moving to states that are worse than our current state. And so this is where this metaphor for annealing starts to come in, where we want to start making more random moves and over time start to make fewer of those random moves based on a particular temperature schedule. So the basic outline looks something like this. Early on in simulated annealing, we have a higher temperature state. And what we mean by a higher temperature state is that we are more likely to accept neighbors that are worse than our current state. We might look at our neighbors. And if one of our neighbors is worse than the current state, especially if it’s not all that much worse, if it’s pretty close but just slightly worse, then we might be more likely to accept that and go ahead and move to that neighbor anyways. But later on as we run simulated annealing, we’re going to decrease that temperature. And at a lower temperature, we’re going to be less likely to accept neighbors that are worse than our current state. Now to formalize this and put a little bit of pseudocode to it, here is what that algorithm might look like. We have a function called simulated annealing that takes as input the problem we’re trying to solve and also potentially some maximum number of times we might want to run the simulated annealing process, how many different neighbors we’re going to try and look for. And that value is going to vary based on the problem you’re trying to solve. We’ll, again, start with some current state that will be equal to the initial state of the problem. But now we need to repeat this process over and over for max number of times. Repeat some process some number of times where we’re first going to calculate a temperature. And this temperature function takes the current time t starting at 1 going all the way up to max and then gives us some temperature that we can use in our computation, where the idea is that this temperature is going to be higher early on and it’s going to be lower later on. So there are a number of ways this temperature function could often work. One of the simplest ways is just to say it is like the proportion of time that we still have remaining. Out of max units of time, how much time do we have remaining? You start off with a lot of that time remaining. And as time goes on, the temperature is going to decrease because you have less and less of that remaining time still available to you. So we calculate a temperature for the current time. And then we pick a random neighbor of the current state. No longer are we going to be picking the best neighbor that we possibly can or just one of the better neighbors that we can. We’re going to pick a random neighbor. It might be better. It might be worse. But we’re going to calculate that. We’re going to calculate delta E, E for energy in this case, which is just how much better is the neighbor than the current state. So if delta E is positive, that means the neighbor is better than our current state. If delta E is negative, that means the neighbor is worse than our current state. And so we can then have a condition that looks like this. If delta E is greater than 0, that means the neighbor state is better than our current state. And if ever that situation arises, we’ll just go ahead and update current to be that neighbor. Same as before, move where we are currently to be the neighbor because the neighbor is better than our current state. We’ll go ahead and accept that. But now the difference is that whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse because sometimes we’re going to need to dislodge ourselves from a local minimum or local maximum to increase the probability that we’re able to find the global minimum or the global maximum a little bit later. And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability. And that probability needs to be based on a couple of factors. It needs to be based in part on the temperature, where if the temperature is higher, we’re more likely to move to a worse neighbor. And if the temperature is lower, we’re less likely to move to a worse neighbor. But it also, to some degree, should be based on delta E. If the neighbor is much worse than the current state, we probably want to be less likely to choose that than if the neighbor is just a little bit worse than the current state. So again, there are a couple of ways you could calculate this. But it turns out one of the most popular is just to calculate E to the power of delta E over T, where E is just a constant. Delta E over T are based on delta E and T here. We calculate that value. And that’ll be some value between 0 and 1. And that is the probability with which we should just say, all right, let’s go ahead and move to that neighbor. And it turns out that if you do the math for this value, when delta E is such that the neighbor is not that much worse than the current state, that’s going to be more likely that we’re going to go ahead and move to that state. And likewise, when the temperature is lower, we’re going to be less likely to move to that neighboring state as well. So now this is the big picture for simulated annealing, this process of taking the problem and going ahead and generating random neighbors will always move to a neighbor if it’s better than our current state. But even if the neighbor is worse than our current state, we’ll sometimes move there depending on how much worse it is and also based on the temperature. And as a result, the hope, the goal of this whole process is that as we begin to try and find our way to the global maximum or the global minimum, we can dislodge ourselves if we ever get stuck at a local maximum or local minimum in order to eventually make our way to exploring the part of the state space that is going to be the best. And then as the temperature decreases, eventually we settle there without moving around too much from what we’ve found to be the globally best thing that we can do thus far. So at the very end, we just return whatever the current state happens to be. And that is the conclusion of this algorithm. We’ve been able to figure out what the solution is. And these types of algorithms have a lot of different applications. Any time you can take a problem and formulate it as something where you can explore a particular configuration and then ask, are any of the neighbors better than this current configuration and have some way of measuring that, then there is an applicable case for these hill climbing, simulated annealing types of algorithms. So sometimes it can be for facility location type problems, like for when you’re trying to plan a city and figure out where the hospitals should be. But there are definitely other applications as well. And one of the most famous problems in computer science is the traveling salesman problem. Traveling salesman problem generally is formulated like this. I have a whole bunch of cities here indicated by these dots. And what I’d like to do is find some route that takes me through all of the cities and ends up back where I started. So some route that starts here, goes through all these cities, and ends up back where I originally started. And what I might like to do is minimize the total distance that I have to travel or the total cost of taking this entire path. And you can imagine this is a problem that’s very applicable in situations like when delivery companies are trying to deliver things to a whole bunch of different houses, they want to figure out, how do I get from the warehouse to all these various different houses and get back again, all using as minimal time and distance and energy as possible. So you might want to try to solve these sorts of problems. But it turns out that solving this particular kind of problem is very computationally difficult. It is a very computationally expensive task to be able to figure it out. This falls under the category of what are known as NP-complete problems, problems that there is no known efficient way to try and solve these sorts of problems. And so what we ultimately have to do is come up with some approximation, some ways of trying to find a good solution, even if we’re not going to find the globally best solution that we possibly can, at least not in a feasible or tractable amount of time. And so what we could do is take the traveling salesman problem and try to formulate it using local search and ask a question like, all right, I can pick some state, some configuration, some route between all of these nodes. And I can measure the cost of that state, figure out what the distance is. And I might now want to try to minimize that cost as much as possible. And then the only question now is, what does it mean to have a neighbor of this state? What does it mean to take this particular route and have some neighboring route that is close to it but slightly different and such that it might have a different total distance? And there are a number of different definitions for what a neighbor of a traveling salesman configuration might look like. But one way is just to say, a neighbor is what happens if we pick two of these edges between nodes and switch them effectively. So for example, I might pick these two edges here, these two that just happened across this node goes here, this node goes there, and go ahead and switch them. And what that process will generally look like is removing both of these edges from the graph, taking this node, and connecting it to the node it wasn’t connected to. So connecting it up here instead. We’ll need to take these arrows that were originally going this way and reverse them, so move them going the other way, and then just fill in that last remaining blank, add an arrow that goes in that direction instead. So by taking two edges and just switching them, I have been able to consider one possible neighbor of this particular configuration. And it looks like this neighbor is actually better. It looks like this probably travels a shorter distance in order to get through all the cities through this route than the current state did. And so you could imagine implementing this idea inside of a hill climbing or simulated annealing algorithm, where we repeat this process to try and take a state of this traveling salesman problem, look at all the neighbors, and then move to the neighbors if they’re better, or maybe even move to the neighbors if they’re worse, until we eventually settle upon some best solution that we’ve been able to find. And it turns out that these types of approximation algorithms, even if they don’t always find the very best solution, can often do pretty well at trying to find solutions that are helpful too. So that then was a look at local search, a particular category of algorithms that can be used for solving a particular type of problem, where we don’t really care about the path to the solution. I didn’t care about the steps I took to decide where the hospitals should go. I just cared about the solution itself. I just care about where the hospitals should be, or what the route through the traveling salesman journey really ought to be. Another type of algorithm that might come up are known as these categories of linear programming types of problems. And linear programming often comes up in the context where we’re trying to optimize for some mathematical function. But oftentimes, linear programming will come up when we might have real numbered values. So it’s not just discrete fixed values that we might have, but any decimal values that we might want to be able to calculate. And so linear programming is a family of types of problems where we might have a situation that looks like this, where the goal of linear programming is to minimize a cost function. And you can invert the numbers and say try and maximize it, but often we’ll frame it as trying to minimize a cost function that has some number of variables, x1, x2, x3, all the way up to xn, just some number of variables that are involved, things that I want to know the values to. And this cost function might have coefficients in front of those variables. And this is what we would call a linear equation, where we just have all of these variables that might be multiplied by a coefficient and then add it together. We’re not going to square anything or cube anything, because that’ll give us different types of equations. With linear programming, we’re just dealing with linear equations in addition to linear constraints, where a constraint is going to look something like if we sum up this particular equation that is just some linear combination of all of these variables, it is less than or equal to some bound b. And we might have a whole number of these various different constraints that we might place onto our linear programming exercise. And likewise, just as we can have constraints that are saying this linear equation is less than or equal to some bound b, it might also be equal to something. That if you want some sum of some combination of variables to be equal to a value, you can specify that. And we can also maybe specify that each variable has lower and upper bounds, that it needs to be a positive number, for example, or it needs to be a number that is less than 50, for example. And there are a number of other choices that we can make there for defining what the bounds of a variable are. But it turns out that if you can take a problem and formulate it in these terms, formulate the problem as your goal is to minimize a cost function, and you’re minimizing that cost function subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value, then there are a number of algorithms that already exist for solving these sorts of problems. So let’s go ahead and take a look at an example. Here’s an example of a problem that might come up in the world of linear programming. Often, this is going to come up when we’re trying to optimize for something. And we want to be able to do some calculations, and we have constraints on what we’re trying to optimize. And so it might be something like this. In the context of a factory, we have two machines, x1 and x2. x1 costs $50 an hour to run. x2 costs $80 an hour to run. And our goal, what we’re trying to do, our objective, is to minimize the total cost. So that’s what we’d like to do. But we need to do so subject to certain constraints. So there might be a labor constraint that x1 requires five units of labor per hour, x2 requires two units of labor per hour, and we have a total of 20 units of labor that we have to spend. So this is a constraint. We have no more than 20 units of labor that we can spend, and we have to spend it across x1 and x2, each of which requires a different amount of labor. And we might also have a constraint like this that tells us x1 is going to produce 10 units of output per hour, x2 is going to produce 12 units of output per hour, and the company needs 90 units of output. So we have some goal, something we need to achieve. We need to achieve 90 units of output, but there are some constraints that x1 can only produce 10 units of output per hour, x2 produces 12 units of output per hour. These types of problems come up quite frequently, and you can start to notice patterns in these types of problems, problems where I am trying to optimize for some goal, minimizing cost, maximizing output, maximizing profits, or something like that. And there are constraints that are placed on that process. And so now we just need to formulate this problem in terms of linear equations. So let’s start with this first point. Two machines, x1 and x2, x costs $50 an hour, x2 costs $80 an hour. Here we can come up with an objective function that might look like this. This is our cost function, rather. 50 times x1 plus 80 times x2, where x1 is going to be a variable representing how many hours do we run machine x1 for, x2 is going to be a variable representing how many hours are we running machine x2 for. And what we’re trying to minimize is this cost function, which is just how much it costs to run each of these machines per hour summed up. This is an example of a linear equation, just some combination of these variables plus coefficients that are placed in front of them. And I would like to minimize that total value. But I need to do so subject to these constraints. x1 requires 50 units of labor per hour, x2 requires 2, and we have a total of 20 units of labor to spend. And so that gives us a constraint of this form. 5 times x1 plus 2 times x2 is less than or equal to 20. 20 is the total number of units of labor we have to spend. And that’s spent across x1 and x2, each of which requires a different number of units of labor per hour, for example. And finally, we have this constraint here. x1 produces 10 units of output per hour, x2 produces 12, and we need 90 units of output. And so this might look something like this. That 10×1 plus 12×2, this is amount of output per hour, it needs to be at least 90. We can do better or great, but it needs to be at least 90. And if you recall from my formulation before, I said that generally speaking in linear programming, we deal with equals constraints or less than or equal to constraints. So we have a greater than or equal to sign here. That’s not a problem. Whenever we have a greater than or equal to sign, we can just multiply the equation by negative 1, and that’ll flip it around to a less than or equals negative 90, for example, instead of a greater than or equal to 90. And that’s going to be an equivalent expression that we can use to represent this problem. So now that we have this cost function and these constraints that it’s subject to, it turns out there are a number of algorithms that can be used in order to solve these types of problems. And these problems go a little bit more into geometry and linear algebra than we’re really going to get into. But the most popular of these types of algorithms are simplex, which was one of the first algorithms discovered for trying to solve linear programs. And later on, a class of interior point algorithms can be used to solve this type of problem as well. The key is not to understand exactly how these algorithms work, but to realize that these algorithms exist for efficiently finding solutions any time we have a problem of this particular form. And so we can take a look, for example, at the production directory here, where here I have a file called production.py, where here I’m using scipy, which was the library for a lot of science-related functions within Python. And I can go ahead and just run this optimization function in order to run a linear program. .linprog here is going to try and solve this linear program for me, where I provide to this expression, to this function call, all of the data about my linear program. So it needs to be in a particular format, which might be a little confusing at first. But this first argument to scipy.optimize.linprogramming is the cost function, which is in this case just an array or a list that has 50 and 80, because my original cost function was 50 times x1 plus 80 times x2. So I just tell Python, 50 and 80, those are the coefficients that I am now trying to optimize for. And then I provide all of the constraints. So the constraints, and I wrote them up above in comments, is the constraint 1 is 5×1 plus 2×2 is less than or equal to 20. And constraint 2 is negative 10×1 plus negative 12×2 is less than or equal to negative 90. And so scipy expects these constraints to be in a particular format. It first expects me to provide all of the coefficients for the upper bound equations, ub just for upper bound, where the coefficients of the first equation are 5 and 2, because we have 5×1 and 2×2. And the coefficients for the second equation are negative 10 and negative 12, because I have negative 10×1 plus negative 12×2. And then here, we provide it as a separate argument, just to keep things separate, what the actual bound is. What is the upper bound for each of these constraints? Well, for the first constraint, the upper bound is 20. That was constraint number 1. And then for constraint number 2, the upper bound is 90. So a bit of a cryptic way of representing it. It’s not quite as simple as just writing the mathematical equations. What really is being expected here are all of the coefficients and all of the numbers that are in these equations by first providing the coefficients for the cost function, then providing all the coefficients for the inequality constraints, and then providing all of the upper bounds for those inequality constraints. And once all of that information is there, then we can run any of these interior point algorithms or the simplex algorithm. Even if you don’t understand how it works, you can just run the function and figure out what the result should be. And here, I said if the result is a success, we were able to solve this problem. Go ahead and print out what the value of x1 and x2 should be. Otherwise, go ahead and print out no solution. And so if I run this program by running python production.py, it takes a second to calculate. But then we see here is what the optimal solution should be. x1 should run for 1.5 hours. x2 should run for 6.25 hours. And we were able to do this by just formulating the problem as a linear equation that we were trying to optimize, some cost that we were trying to minimize, and then some constraints that were placed on that. And many, many problems fall into this category of problems that you can solve if you can just figure out how to use equations and use these constraints to represent that general idea. And that’s a theme that’s going to come up a couple of times today, where we want to be able to take some problem and reduce it down to some problem we know how to solve in order to begin to find a solution and to use existing methods that we can use in order to find a solution more effectively or more efficiently. And it turns out that these types of problems, where we have constraints, show up in other ways too. And there’s an entire class of problems that’s more generally just known as constraint satisfaction problems. And we’re going to now take a look at how you might formulate a constraint satisfaction problem and how you might go about solving a constraint satisfaction problem. But the basic idea of a constraint satisfaction problem is we have some number of variables that need to take on some values. And we need to figure out what values each of those variables should take on. But those variables are subject to particular constraints that are going to limit what values those variables can actually take on. So let’s take a look at a real world example, for example. Let’s look at exam scheduling, that I have four students here, students 1, 2, 3, and 4. Each of them is taking some number of different classes. Classes here are going to be represented by letters. So student 1 is enrolled in courses A, B, and C. Student 2 is enrolled in courses B, D, and E, so on and so forth. And now, say university, for example, is trying to schedule exams for all of these courses. But there are only three exam slots on Monday, Tuesday, and Wednesday. And we have to schedule an exam for each of these courses. But the constraint now, the constraint we have to deal with with the scheduling, is that we don’t want anyone to have to take two exams on the same day. We would like to try and minimize that or eliminate it if at all possible. So how do we begin to represent this idea? How do we structure this in a way that a computer with an AI algorithm can begin to try and solve the problem? Well, let’s in particular just look at these classes that we might take and represent each of the courses as some node inside of a graph. And what we’ll do is we’ll create an edge between two nodes in this graph if there is a constraint between those two nodes. So what does this mean? Well, we can start with student 1, who’s enrolled in courses A, B, and C. What that means is that A and B can’t have an exam at the same time. A and C can’t have an exam at the same time. And B and C also can’t have an exam at the same time. And I can represent that in this graph by just drawing edges. One edge between A and B, one between B and C, and then one between C and A. And that encodes now the idea that between those nodes, there is a constraint. And in particular, the constraint happens to be that these two can’t be equal to each other, though there are other types of constraints that are possible, depending on the type of problem that you’re trying to solve. And then we can do the same thing for each of the other students. So for student 2, who’s enrolled in courses B, D, and E, well, that means B, D, and E, those all need to have edges that connect each other as well. Student 3 is enrolled in courses C, E, and F. So we’ll go ahead and take C, E, and F and connect those by drawing edges between them too. And then finally, student 4 is enrolled in courses E, F, and G. And we can represent that by drawing edges between E, F, and G, although E and F already had an edge between them. We don’t need another one, because this constraint is just encoding the idea that course E and course F cannot have an exam on the same day. So this then is what we might call the constraint graph. There’s some graphical representation of all of my variables, so to speak, and the constraints between those possible variables. Where in this particular case, each of the constraints represents an inequality constraint, that an edge between B and D means whatever value the variable B takes on cannot be the value that the variable D takes on as well. So what then actually is a constraint satisfaction problem? Well, a constraint satisfaction problem is just some set of variables, x1 all the way through xn, some set of domains for each of those variables. So every variable needs to take on some values. Maybe every variable has the same domain, but maybe each variable has a slightly different domain. And then there’s a set of constraints, and we’ll just call a set C, that is some constraints that are placed upon these variables, like x1 is not equal to x2. But there could be other forms too, like maybe x1 equals x2 plus 1 if these variables are taking on numerical values in their domain, for example. The types of constraints are going to vary based on the types of problems. And constraint satisfaction shows up all over the place as well, in any situation where we have variables that are subject to particular constraints. So one popular game is Sudoku, for example, this 9 by 9 grid where you need to fill in numbers in each of these cells, but you want to make sure there’s never a duplicate number in any row, or in any column, or in any grid of 3 by 3 cells, for example. So what might this look like as a constraint satisfaction problem? Well, my variables are all of the empty squares in the puzzle. So represented here is just like an x comma y coordinate, for example, as all of the squares where I need to plug in a value, where I don’t know what value it should take on. The domain is just going to be all of the numbers from 1 through 9, any value that I could fill in to one of these cells. So that is going to be the domain for each of these variables. And then the constraints are going to be of the form, like this cell can’t be equal to this cell, can’t be equal to this cell, can’t be, and all of these need to be different, for example, and same for all of the rows, and the columns, and the 3 by 3 squares as well. So those constraints are going to enforce what values are actually allowed. And we can formulate the same idea in the case of this exam scheduling problem, where the variables we have are the different courses, a up through g. The domain for each of these variables is going to be Monday, Tuesday, and Wednesday. Those are the possible values each of the variables can take on, that in this case just represent when is the exam for that class. And then the constraints are of this form, a is not equal to b, a is not equal to c, meaning a and b can’t have an exam on the same day, a and c can’t have an exam on the same day. Or more formally, these two variables cannot take on the same value within their domain. So that then is this formulation of a constraint satisfaction problem that we can begin to use to try and solve this problem. And constraints can come in a number of different forms. There are hard constraints, which are constraints that must be satisfied for a correct solution. So something like in the Sudoku puzzle, you cannot have this cell and this cell that are in the same row take on the same value. That is a hard constraint. But problems can also have soft constraints, where these are constraints that express some notion of preference, that maybe a and b can’t have an exam on the same day, but maybe someone has a preference that a’s exam is earlier than b’s exam. It doesn’t need to be the case with some expression that some solution is better than another solution. And in that case, you might formulate the problem as trying to optimize for maximizing people’s preferences. You want people’s preferences to be satisfied as much as possible. In this case, though, we’ll mostly just deal with hard constraints, constraints that must be met in order to have a correct solution to the problem. So we want to figure out some assignment of these variables to their particular values that is ultimately going to give us a solution to the problem by allowing us to assign some day to each of the classes such that we don’t have any conflicts between classes. So it turns out that we can classify the constraints in a constraint satisfaction problem into a number of different categories. The first of those categories are perhaps the simplest of the types of constraints, which are known as unary constraints, where unary constraint is a constraint that just involves a single variable. For example, a unary constraint might be something like, a does not equal Monday, meaning Course A cannot have its exam on Monday. If for some reason the instructor for the course isn’t available on Monday, you might have a constraint in your problem that looks like this, something that just has a single variable a in it, and maybe says a is not equal to Monday, or a is equal to something, or in the case of numbers greater than or less than something, a constraint that just has one variable, we consider to be a unary constraint. And this is in contrast to something like a binary constraint, which is a constraint that involves two variables, for example. So this would be a constraint like the ones we were looking at before. Something like a does not equal b is an example of a binary constraint, because it is a constraint that has two variables involved in it, a and b. And we represented that using some arc or some edge that connects variable a to variable b. And using this knowledge of, OK, what is a unary constraint? What is a binary constraint? There are different types of things we can say about a particular constraint satisfaction problem. And one thing we can say is we can try and make the problem node consistent. So what does node consistency mean? Node consistency means that we have all of the values in a variable’s domain satisfying that variable’s unary constraints. So for each of the variables inside of our constraint satisfaction problem, if all of the values satisfy the unary constraints for that particular variable, we can say that the entire problem is node consistent, or we can even say that a particular variable is node consistent if we just want to make one node consistent within itself. So what does that actually look like? Let’s look at now a simplified example, where instead of having a whole bunch of different classes, we just have two classes, a and b, each of which has an exam on either Monday or Tuesday or Wednesday. So this is the domain for the variable a, and this is the domain for the variable b. And now let’s imagine we have these constraints, a not equal to Monday, b not equal to Tuesday, b not equal to Monday, a not equal to b. So those are the constraints that we have on this particular problem. And what we can now try to do is enforce node consistency. And node consistency just means we make sure that all of the values for any variable’s domain satisfy its unary constraints. And so we could start by trying to make node a node consistent. Is it consistent? Does every value inside of a’s domain satisfy its unary constraints? Well, initially, we’ll see that Monday does not satisfy a’s unary constraints, because we have a constraint, a unary constraint here, that a is not equal to Monday. But Monday is still in a’s domain. And so this is something that is not node consistent, because we have Monday in the domain. But this is not a valid value for this particular node. And so how do we make this node consistent? Well, to make the node consistent, what we’ll do is we’ll just go ahead and remove Monday from a’s domain. Now a can only be on Tuesday or Wednesday, because we had this constraint that said a is not equal to Monday. And at this point now, a is node consistent. For each of the values that a can take on, Tuesday and Wednesday, there is no constraint that is a unary constraint that conflicts with that idea. There is no constraint that says that a can’t be Tuesday. There is no unary constraint that says that a cannot be on Wednesday. And so now we can turn our attention to b. b also has a domain, Monday, Tuesday, and Wednesday. And we can begin to see whether those variables satisfy the unary constraints as well. Well, here is a unary constraint, b is not equal to Tuesday. And that does not appear to be satisfied by this domain of Monday, Tuesday, and Wednesday, because Tuesday, this possible value that the variable b could take on is not consistent with this unary constraint, that b is not equal to Tuesday. So to solve that problem, we’ll go ahead and remove Tuesday from b’s domain. Now b’s domain only contains Monday and Wednesday. But as it turns out, there’s yet another unary constraint that we placed on the variable b, which is here. b is not equal to Monday. And that means that this value, Monday, inside of b’s domain, is not consistent with b’s unary constraints, because we have a constraint that says the b cannot be Monday. And so we can remove Monday from b’s domain. And now we’ve made it through all of the unary constraints. We’ve not yet considered this constraint, which is a binary constraint. But we’ve considered all of the unary constraints, all of the constraints that involve just a single variable. And we’ve made sure that every node is consistent with those unary constraints. So we can say that now we have enforced node consistency, that for each of these possible nodes, we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it. So node consistency is fairly easy to enforce. We just take each node, make sure the values in the domain satisfy the unary constraints. Where things get a little bit more interesting is when we consider different types of consistency, something like arc consistency, for example. And arc consistency refers to when all of the values in a variable’s domain satisfy the variable’s binary constraints. So when we’re looking at trying to make a arc consistent, we’re no longer just considering the unary constraints that involve a. We’re trying to consider all of the binary constraints that involve a as well. So any edge that connects a to another variable inside of that constraint graph that we were taking a look at before. Put a little bit more formally, arc consistency. And arc really is just another word for an edge that connects two of these nodes inside of our constraint graph. We can define arc consistency a little more precisely like this. In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y. So put another way, if I have a variable x and I want to make x an arc consistent, then I’m going to look at all of the possible values that x can take on and make sure that for all of those possible values, there is still some choice that I can make for y, if there’s some arc between x and y, to make sure that y has a possible option that I can choose as well. So let’s look at an example of that going back to this example from before. We enforced node consistency already by saying that a can only be on Tuesday or Wednesday because we knew that a could not be on Monday. And we also said that b’s only domain only consists of Wednesday because we know that b does not equal Tuesday and also b does not equal Monday. So now let’s begin to consider arc consistency. Let’s try and make a arc consistent with b. And what that means is to make a arc consistent with respect to b means that for any choice we make in a’s domain, there is some choice we can make in b’s domain that is going to be consistent. And we can try that. For a, we can choose Tuesday as a possible value for a. If I choose Tuesday for a, is there a value for b that satisfies the binary constraint? Well, yes, b Wednesday would satisfy this constraint that a does not equal b because Tuesday does not equal Wednesday. However, if we chose Wednesday for a, well, then there is no choice in b’s domain that satisfies this binary constraint. There is no way I can choose something for b that satisfies a does not equal b because I know b must be Wednesday. And so if ever I run into a situation like this where I see that here is a possible value for a such that there is no choice of value for b that satisfies the binary constraint, well, then this is not arc consistent. And to make it arc consistent, I would need to take Wednesday and remove it from a’s domain. Because Wednesday was not going to be a possible choice I can make for a because it wasn’t consistent with this binary constraint for b. There was no way I could choose Wednesday for a and still have an available solution by choosing something for b as well. So here now, I’ve been able to enforce arc consistency. And in doing so, I’ve actually solved this entire problem, that given these constraints where a and b can have exams on either Monday or Tuesday or Wednesday, the only solution, as it would appear, is that a’s exam must be on Tuesday and b’s exam must be on Wednesday. And that is the only option available to me. So if we want to apply our consistency to a larger graph, not just looking at one particular pair of our consistency, there are ways we can do that too. And we can begin to formalize what the pseudocode would look like for trying to write an algorithm that enforces arc consistency. And we’ll start by defining a function called revise. Revise is going to take as input a CSP, otherwise known as a constraint satisfaction problem, and also two variables, x and y. And what revise is going to do is it is going to make x arc consistent with respect to y, meaning remove anything from x’s domain that doesn’t allow for a possible option for y. How does this work? Well, we’ll go ahead and first keep track of whether or not we’ve made a revision. Revise is ultimately going to return true or false. It’ll return true in the event that we did make a revision to x’s domain. It’ll return false if we didn’t make any change to x’s domain. And we’ll see in a moment why that’s going to be helpful. But we start by saying revised equals false. We haven’t made any changes. Then we’ll say, all right, let’s go ahead and loop over all of the possible values in x’s domain. So loop over x’s domain for each little x in x’s domain. I want to make sure that for each of those choices, I have some available choice in y that satisfies the binary constraints that are defined inside of my CSP, inside of my constraint satisfaction problem. So if ever it’s the case that there is no value y in y’s domain that satisfies the constraint for x and y, well, if that’s the case, that means that this value x shouldn’t be in x’s domain. So we’ll go ahead and delete x from x’s domain. And I’ll set revised equal to true because I did change x’s domain. I changed x’s domain by removing little x. And I removed little x because it wasn’t art consistent. There was no way I could choose a value for y that would satisfy this xy constraint. So in this case, we’ll go ahead and set revised equal true. And we’ll do this again and again for every value in x’s domain. Sometimes it might be fine. In other cases, it might not allow for a possible choice for y, in which case we need to remove this value from x’s domain. And at the end, we just return revised to indicate whether or not we actually made a change. So this function, then, this revised function is effectively an implementation of what you saw me do graphically a moment ago. And it makes one variable, x, arc consistent with another variable, in this case, y. But generally speaking, when we want to enforce our consistency, we’ll often want to enforce our consistency not just for a single arc, but for the entire constraint satisfaction problem. And it turns out there’s an algorithm to do that as well. And that algorithm is known as AC3. AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem. How does it do that? Well, it’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent. And over time, we might remove things from that queue as we begin dealing with our consistency. And we might need to add things to that queue as well if there are more things we need to make arc consistent. So we’ll go ahead and start with a queue that contains all of the arcs in the constraint satisfaction problem, all of the edges that connect two nodes that have some sort of binary constraint between them. And now, as long as the queue is non-empty, there is work to be done. The queue is all of the things that we need to make arc consistent. So as long as the queue is non-empty, there’s still things we have to do. What do we have to do? Well, we’ll start by de-queuing from the queue, remove something from the queue. And strictly speaking, it doesn’t need to be a queue, but a queue is a traditional way of doing this. We’ll de-queue from the queue, and that’ll give us an arc, x and y, these two variables where I would like to make x arc consistent with y. So how do we make x arc consistent with y? Well, we can go ahead and just use that revise function that we talked about a moment ago. We called the revise function, passing as input the constraint satisfaction problem, and also these variables x and y, because I want to make x arc consistent with y. In other words, remove any values from x’s domain that don’t leave an available option for y. And recall, what does revised return? Well, it returns true if we actually made a change, if we removed something from x’s domain, because there wasn’t an available option for y, for example. And it returns false if we didn’t make any change to x’s domain at all. And it turns out if revised returns false, if we didn’t make any changes, well, then there’s not a whole lot more work to be done here for this arc. We can just move ahead to the next arc that’s in the queue. But if we did make a change, if we did reduce x’s domain by removing values from x’s domain, well, then what we might realize is that this creates potential problems later on, that it might mean that some arc that was arc consistent with x, that node might no longer be arc consistent with x, because while there used to be an option that we could choose for x, now there might not be, because now we might have removed something from x that was necessary for some other arc to be arc consistent. And so if ever we did revise x’s domain, we’re going to need to add some things to the queue, some additional arcs that we might want to check. How do we do that? Well, first thing we want to check is to make sure that x’s domain is not 0. If x’s domain is 0, that means there are no available options for x at all. And that means that there’s no way you can solve the constraint satisfaction problem. If we’ve removed everything from x’s domain, we’ll go ahead and just return false here to indicate there’s no way to solve the problem, because there’s nothing left in x’s domain. But otherwise, if there are things left in x’s domain, but fewer things than before, well, then what we’ll do is we’ll loop over each variable z that is in all of x’s neighbors, except for y, y we already handled. But we’ll consider all of x’s other’s neighbors and ask ourselves, all right, will that arc from each of those z’s to x, that arc might no longer be arc consistent, because while for each z, there might have been a possible option we could choose for x to correspond with each of z’s possible values, now there might not be, because we removed some elements from x’s domain. And so what we’ll do here is we’ll go ahead and enqueue, adding something to the queue, this arc zx for all of those neighbors z. So we need to add back some arcs to the queue in order to continue to enforce arc consistency. At the very end, if we make it through all this process, then we can return true. But this now is AC3, this algorithm for enforcing arc consistency on a constraint satisfaction problem. And the big idea is really just keep track of all of the arcs that we might need to make arc consistent, make it arc consistent by calling the revise function. And if we did revise it, then there are some new arcs that might need to be added to the queue in order to make sure that everything is still arc consistent, even after we’ve removed some of the elements from a particular variable’s domain. So what then would happen if we tried to enforce arc consistency on a graph like this, on a graph where each of these variables has a domain of Monday, Tuesday, and Wednesday? Well, it turns out that by enforcing arc consistency on this graph, well, it can solve some types of problems. Nothing actually changes here. For any particular arc, just considering two variables, there’s always a way for me to just, for any of the choices I make for one of them, make a choice for the other one, because there are three options, and I just need the two to be different from each other. So this is actually quite easy to just take an arc and just declare that it is arc consistent, because if I pick Monday for D, then I just pick something that isn’t Monday for B. In arc consistency, we only consider consistency between a binary constraint between two nodes, and we’re not really considering all of the rest of the nodes yet. So just using AC3, the enforcement of arc consistency, that can sometimes have the effect of reducing domains to make it easier to find solutions, but it will not always actually solve the problem. We might still need to somehow search to try and find a solution. And we can use classical traditional search algorithms to try to do so. You’ll recall that a search problem generally consists of these parts. We have some initial state, some actions, a transition model that takes me from one state to another state, a goal test to tell me have I satisfied my objective correctly, and then some path cost function, because in the case of like maze solving, I was trying to get to my goal as quickly as possible. So you could formulate a CSP, or a constraint satisfaction problem, as one of these types of search problems. The initial state will just be an empty assignment, where an assignment is just a way for me to assign any particular variable to any particular value. So if an empty assignment is no variables that are assigned to any values yet, then the action I can take is adding some new variable equals value pair to that assignment, saying for this assignment, let me add a new value for this variable. And the transition model just defines what happens when you take that action. You get a new assignment that has that variable equal to that value inside of it. The goal test is just checking to make sure all the variables have been assigned and making sure all the constraints have been satisfied. And the path cost function is sort of irrelevant. I don’t really care about what the path really is. I just care about finding some assignment that actually satisfies all of the constraints. So really, all the paths have the same cost. I don’t really care about the path to the goal. I just care about the solution itself, much as we’ve talked about now before. The problem here, though, is that if we just implement this naive search algorithm just by implementing like breadth-first search or depth-first search, this is going to be very, very inefficient. And there are ways we can take advantage of efficiencies in the structure of a constraint satisfaction problem itself. And one of the key ideas is that we can really just order these variables. And it doesn’t matter what order we assign variables in. The assignment a equals 2 and then b equals 8 is identical to the assignment of b equals 8 and then a equals 2. Switching the order doesn’t really change anything about the fundamental nature of that assignment. And so there are some ways that we can try and revise this idea of a search algorithm to apply it specifically for a problem like a constraint satisfaction problem. And it turns out the search algorithm we’ll generally use when talking about constraint satisfaction problems is something known as backtracking search. And the big idea of backtracking search is we’ll go ahead and make assignments from variables to values. And if ever we get stuck, we arrive at a place where there is no way we can make any forward progress while still preserving the constraints that we need to enforce, we’ll go ahead and backtrack and try something else instead. So the very basic sketch of what backtracking search looks like is it looks like this. Function called backtrack that takes as input an assignment and a constraint satisfaction problem. So initially, we don’t have any assigned variables. So when we begin backtracking search, this assignment is just going to be the empty assignment with no variables inside of it. But we’ll see later this is going to be a recursive function. So backtrack takes as input the assignment and the problem. If the assignment is complete, meaning all of the variables have been assigned, we just return that assignment. That, of course, won’t be true initially, because we start with an empty assignment. But over time, we might add things to that assignment. So if ever the assignment actually is complete, then we’re done. Then just go ahead and return that assignment. But otherwise, there is some work to be done. So what we’ll need to do is select an unassigned variable for this particular problem. So we need to take the problem, look at the variables that have already been assigned, and pick a variable that has not yet been assigned. And I’ll go ahead and take that variable. And then I need to consider all of the values in that variable’s domain. So we’ll go ahead and call this domain values function. We’ll talk a little more about that later, that takes a variable and just gives me back an ordered list of all of the values in its domain. So I’ve taken a random unselected variable. I’m going to loop over all of the possible values. And the idea is, let me just try all of these values as possible values for the variable. So if the value is consistent with the assignment so far, it doesn’t violate any of the constraints, well then let’s go ahead and add variable equals value to the assignment because it’s so far consistent. And now let’s recursively call backtrack to try and make the rest of the assignments also consistent. So I’ll go ahead and call backtrack on this new assignment that I’ve added the variable equals value to. And now I recursively call backtrack and see what the result is. And if the result isn’t a failure, well then let me just return that result. And otherwise, what else could happen? Well, if it turns out the result was a failure, well then that means this value was probably a bad choice for this particular variable because when I assigned this variable equal to that value, eventually down the road I ran into a situation where I violated constraints. There was nothing more I could do. So now I’ll remove variable equals value from the assignment, effectively backtracking to say, all right, that value didn’t work. Let’s try another value instead. And then at the very end, if we were never able to return a complete assignment, we’ll just go ahead and return failure because that means that none of the values worked for this particular variable. This now is the idea for backtracking search, to take each of the variables, try values for them, and recursively try backtracking search, see if we can make progress. And if ever we run into a dead end, we run into a situation where there is no possible value we can choose that satisfies the constraints, we return failure. And that propagates up, and eventually we make a different choice by going back and trying something else instead. So let’s put this algorithm into practice. Let’s actually try and use backtracking search to solve this problem now, where I need to figure out how to assign each of these courses to an exam slot on Monday or Tuesday or Wednesday in such a way that it satisfies these constraints, that each of these edges mean those two classes cannot have an exam on the same day. So I can start by just starting at a node. It doesn’t really matter which I start with, but in this case, I’ll just start with A. And I’ll ask the question, all right, let me loop over the values in the domain. And maybe in this case, I’ll just start with Monday and say, all right, let’s go ahead and assign A to Monday. We’ll just go and order Monday, Tuesday, Wednesday. And now let’s consider node B. So I’ve made an assignment to A, so I recursively call backtrack with this new part of the assignment. And now I’m looking to pick another unassigned variable like B. And I’ll say, all right, maybe I’ll start with Monday, because that’s the very first value in B’s domain. And I ask, all right, does Monday violate any constraints? And it turns out, yes, it does. It violates this constraint here between A and B, because A and B are now both on Monday, and that doesn’t work, because B can’t be on the same day as A. So that doesn’t work. So we might instead try Tuesday, try the next value in B’s domain. And is that consistent with the assignment so far? Well, yeah, B, Tuesday, A, Monday, that is consistent so far, because they’re not on the same day. So that’s good. Now we can recursively call backtrack. Try again. Pick another unassigned variable, something like D, and say, all right, let’s go through its possible values. Is Monday consistent with this assignment? Well, yes, it is. B and D are on different days, Monday versus Tuesday. And A and B are also on different days, Monday versus Tuesday. So that’s fine so far, too. We’ll go ahead and try again. Maybe we’ll go to this variable here, E. Say, can we make that consistent? Let’s go through the possible values. We’ve recursively called backtrack. We might start with Monday and say, all right, that’s not consistent, because D and E now have exams on the same day. So we might try Tuesday instead, going to the next one. Ask, is that consistent? Well, no, it’s not, because B and E, those have exams on the same day. And so we try, all right, is Wednesday consistent? And in turn, it’s like, all right, yes, it is. Wednesday is consistent, because D and E now have exams on different days. B and E now have exams on different days. All seems to be well so far. I recursively call backtrack, select another unassigned variable, we’ll say maybe choose C this time, and say, all right, let’s try the values that C could take on. Let’s start with Monday. And it turns out that’s not consistent, because now A and C both have exams on the same day. So I try Tuesday and say, that’s not consistent either, because B and C now have exams on the same day. And then I say, all right, let’s go ahead and try Wednesday. But that’s not consistent either, because C and E each have exams on the same day too. So now we’ve gone through all the possible values for C, Monday, Tuesday, and Wednesday. And none of them are consistent. There is no way we can have a consistent assignment. Backtrack, in this case, will return a failure. And so then we’d say, all right, we have to backtrack back to here. Well, now for E, we’ve tried all of Monday, Tuesday, and Wednesday. And none of those work, because Wednesday, which seemed to work, turned out to be a failure. So that means there’s no possible way we can assign E. So that’s a failure too. We have to go back up to D, which means that Monday assignment to D, that must be wrong. We must try something else. So we can try, all right, what if instead of Monday, we try Tuesday? Tuesday, it turns out, is not consistent, because B and D now have an exam on the same day. But Wednesday, as it turns out, works. And now we can begin to mix and forward progress again. We go back to E and say, all right, which of these values works? Monday turns out to work by not violating any constraints. Then we go up to C now. Monday doesn’t work, because it violates a constraint. Violates two, actually. Tuesday doesn’t work, because it violates a constraint as well. But Wednesday does work. Then we can go to the next variable, F, and say, all right, does Monday work? We’ll know. It violates a constraint. But Tuesday does work. And then finally, we can look at the last variable, G, recursively calling backtrack one more time. Monday is inconsistent. That violates a constraint. Tuesday also violates a constraint. But Wednesday, that doesn’t violate a constraint. And so now at this point, we recursively call backtrack one last time. We now have a satisfactory assignment of all of the variables. And at this point, we can say that we are now done. We have now been able to successfully assign a variable or a value to each one of these variables in such a way that we’re not violating any constraints. We’re going to go ahead and have classes A and E have their exams on Monday. Classes B and F can have their exams on Tuesday. And classes C, D, and G can have their exams on Wednesday. And there’s no violated constraints that might come up there. So that then was a graphical look at how this might work. Let’s now take a look at some code we could use to actually try and solve this problem as well. So here I’ll go ahead and go into the scheduling directory. We’re here now. We’ll start by looking at schedule0.py. We’re here. I define a list of variables, A, B, C, D, E, F, G. Those are all different classes. Then underneath that, I define my list of constraints. So constraint A and B. That is a constraint because they can’t be on the same day. Likewise, A and C, B and C, so on and so forth, enforcing those exact same constraints. And here then is what the backtracking function might look like. First, if the assignment is complete, if I’ve made an assignment of every variable to a value, go ahead and just return that assignment. Then we’ll select an unassigned variable from that assignment. Then for each of the possible values in the domain, Monday, Tuesday, Wednesday, let’s go ahead and create a new assignment that assigns the variable to that value. I’ll call this consistent function, which I’ll show you in a moment, that just checks to make sure this new assignment is consistent. But if it is consistent, we’ll go ahead and call backtrack to go ahead and continue trying to run backtracking search. And as long as the result is not none, meaning it wasn’t a failure, we can go ahead and return that result. But if we make it through all the values and nothing works, then it is a failure. There’s no solution. We go ahead and return none here. What do these functions do? Select unassigned variable is just going to choose a variable not yet assigned. So it’s going to loop over all the variables. And if it’s not already assigned, we’ll go ahead and just return that variable. And what does the consistent function do? Well, the consistent function goes through all the constraints. And if we have a situation where we’ve assigned both of those values to variables, but they are the same, well, then that is a violation of the constraint, in which case we’ll return false. But if nothing is inconsistent, then the assignment is consistent and will return true. And then all the program does is it calls backtrack on an empty assignment, an empty dictionary that has no variable assigned and no values yet, save that inside a solution, and then print out that solution. So by running this now, I can run Python schedule0.py. And what I get as a result of that is an assignment of all these variables to values. And it turns out we assign a to Monday as we would expect, b to Tuesday, c to Wednesday, exactly the same type of thing we were talking about before, an assignment of each of these variables to values that doesn’t violate any constraints. And I had to do a fair amount of work in order to implement this idea myself. I had to write the backtrack function that went ahead and went through this process of recursively trying to do this backtracking search. But it turns out the constraint satisfaction problems are so popular that there exist many libraries that already implement this type of idea. Again, as with before, the specific library is not as important as the fact that libraries do exist. This is just one example of a Python constraint library, where now, rather than having to do all the work from scratch inside of schedule1.py, I’m just taking advantage of a library that implements a lot of these ideas already. So here, I create a new problem, add variables to it with particular domains. I add a whole bunch of these individual constraints, where I call addConstraint and pass in a function describing what the constraint is. And the constraint basically says the function that takes two variables, x and y, and makes sure that x is not equal to y, enforcing the idea that these two classes cannot have exams on the same day. And then, for any constraint satisfaction problem, I can call getSolutions to get all the solutions to that problem. And then, for each of those solutions, print out what that solution happens to be. And if I run python schedule1.py, and now see, there are actually a number of different solutions that can be used to solve the problem. There are, in fact, six different solutions, assignments of variables to values that will give me a satisfactory answer to this constraint satisfaction problem. So this then was an implementation of a very basic backtracking search method, where really we just went through each of the variables, picked one that wasn’t assigned, tried the possible values the variable could take on. And then, if it worked, if it didn’t violate any constraints, then we kept trying other variables. And if ever we hit a dead end, we had to backtrack. But ultimately, we might be able to be a little bit more intelligent about how we do this in order to improve the efficiency of how we solve these sorts of problems. And one thing we might imagine trying to do is going back to this idea of inference, using the knowledge we know to be able to draw conclusions in order to make the rest of the problem solving process a little bit easier. And let’s now go back to where we got stuck in this problem the first time. When we were solving this constraint satisfaction problem, we dealt with B. And then we went on to D. And we went ahead and just assigned D to Monday, because that seemed to work with the assignment so far. It didn’t violate any constraints. But it turned out that later on that choice turned out to be a bad one, that that choice wasn’t consistent with the rest of the values that we could take on here. And the question is, is there anything we could do to avoid getting into a situation like this, avoid trying to go down a path that’s ultimately not going to lead anywhere by taking advantage of knowledge that we have initially? And it turns out we do have that kind of knowledge. We can look at just the structure of this graph so far. And we can say that right now C’s domain, for example, contains values Monday, Tuesday, and Wednesday. And based on those values, we can say that this graph is not arc consistent. Recall that arc consistency is all about making sure that for every possible value for a particular node, that there is some other value that we are able to choose. And as we can see here, Monday and Tuesday are not going to be possible values that we can choose for C. They’re not going to be consistent with a node like B, for example, because B is equal to Tuesday, which means that C cannot be Tuesday. And because A is equal to Monday, C also cannot be Monday. So using that information, by making C arc consistent with A and B, we could remove Monday and Tuesday from C’s domain and just leave C with Wednesday, for example. And if we continued to try and enforce arc consistency, we’d see there are some other conclusions we can draw as well. We see that B’s only option is Tuesday and C’s only option is Wednesday. And so if we want to make E arc consistent, well, E can’t be Tuesday, because that wouldn’t be arc consistent with B. And E can’t be Wednesday, because that wouldn’t be arc consistent with C. So we can go ahead and say E and just set that equal to Monday, for example. And then we can begin to do this process again and again, that in order to make D arc consistent with B and E, then D would have to be Wednesday. That’s the only possible option. And likewise, we can make the same judgments for F and G as well. And it turns out that without having to do any additional search, just by enforcing arc consistency, we were able to actually figure out what the assignment of all the variables should be without needing to backtrack at all. And the way we did that is by interleaving this search process and the inference step, by this step of trying to enforce arc consistency. And the algorithm to do this is often called just the maintaining arc consistency algorithm, which just enforces arc consistency every time we make a new assignment of a value to an existing variable. So sometimes we can enforce our consistency using that AC3 algorithm at the very beginning of the problem before we even begin searching in order to limit the domain of the variables in order to make it easier to search. But we can also take advantage of the interleaving of enforcing our consistency with search such that every time in the search process we make a new assignment, we go ahead and enforce arc consistency as well to make sure that we’re just eliminating possible values from domains whenever possible. And how do we do this? Well, this is really equivalent to just every time we make a new assignment to a variable x. We’ll go ahead and call our AC3 algorithm, this algorithm that enforces arc consistency on a constraint satisfaction problem. And we go ahead and call that, starting it with a Q, not of all of the arcs, which we did originally, but just of all of the arcs that we want to make arc consistent with x, this thing that we have just made an assignment to. So all arcs yx, where y is a neighbor of x, something that shares a constraint with x, for example. And by maintaining arc consistency in the backtracking search process, we can ultimately make our search process a little bit more efficient. And so this is the revised version of this backtrack function. Same as before, the changes here are highlighted in yellow. Every time we add a new variable equals value to our assignment, we’ll go ahead and run this inference procedure, which might do a number of different things. But one thing it could do is call the maintaining arc consistency algorithm to make sure we’re able to enforce arc consistency on the problem. And we might be able to draw new inferences as a result of that process. Get new guarantees of this variable needs to be equal to that value, for example. That might happen one time. It might happen many times. And so long as those inferences are not a failure, as long as they don’t lead to a situation where there is no possible way to make forward progress, well, then we can go ahead and add those inferences, those new knowledge, that new pieces of knowledge I know about what variables should be assigned to what values, I can add those to the assignment in order to more quickly make forward progress by taking advantage of information that I can just deduce, information I know based on the rest of the structure of the constraint satisfaction problem. And the only other change I’ll need to make now is if it turns out this value doesn’t work, well, then down here, I’ll go ahead and need to remove not only variable equals value, but also any of those inferences that I made, remove that from the assignment as well. So here, then, we’re often able to solve the problem by backtracking less than we might originally have needed to, just by taking advantage of the fact that every time we make a new assignment of one variable to one value, that might reduce the domains of other variables as well. And we can use that information to begin to more quickly draw conclusions in order to try and solve the problem more efficiently as well. And it turns out there are other heuristics we can use to try and improve the efficiency of our search process as well. And it really boils down to a couple of these functions that I’ve talked about, but we haven’t really talked about how they’re working. And one of them is this function here, select unassigned variable, where we’re selecting some variable in the constraint satisfaction problem that has not yet been assigned. So far, I’ve sort of just been selecting variables randomly, just like picking one variable and one unassigned variable in order to decide, all right, this is the variable that we’re going to assign next, and then going from there. But it turns out that by being a little bit intelligent, by following certain heuristics, we might be able to make the search process much more efficient just by choosing very carefully which variable we should explore next. So some of those heuristics include the minimum remaining values, or MRV heuristic, which generally says that if I have a choice between which variable I should select, I should select the variable with the smallest domain, the variable that has the fewest number of remaining values left. With the idea being, if there are only two remaining values left, well, I may as well prune one of them very quickly in order to get to the other, because one of those two has got to be the solution, if a solution does exist. Sometimes minimum remaining values might not give a conclusive result if all the nodes have the same number of remaining values, for example. And in that case, another heuristic that can be helpful to look at is the degree heuristic. The degree of a node is the number of nodes that are attached to that node, the number of nodes that are constrained by that particular node. And if you imagine which variable should I choose, should I choose a variable that has a high degree that is connected to a lot of different things, or a variable with a low degree that is not connected to a lot of different things, well, it can often make sense to choose the variable that has the highest degree that is connected to the most other nodes as the thing you would search first. Why is that the case? Well, it’s because by choosing a variable with a high degree, that is immediately going to constrain the rest of the variables more, and it’s more likely to be able to eliminate large sections of the state space that you don’t need to search through at all. So what could this actually look like? Let’s go back to this search problem here. In this particular case, I’ve made an assignment here. I’ve made an assignment here. And the question is, what should I look at next? And according to the minimum remaining values heuristic, what I should choose is the variable that has the fewest remaining possible values. And in this case, that’s this node here, node C, that only has one variable left in this domain, which in this case is Wednesday, which is a very reasonable choice of a next assignment to make, because I know it’s the only option, for example. I know that the only possible option for C is Wednesday, so I may as well make that assignment and then potentially explore the rest of the space after that. But meanwhile, at the very start of the problem, when I didn’t have any knowledge of what nodes should have what values yet, I still had to pick what node should be the first one that I try and assign a value to. And I arbitrarily just chose the one at the top, node A originally. But we can be more intelligent about that. We can look at this particular graph. All of them have domains of the same size, domain of size 3. So minimum remaining values doesn’t really help us there. But we might notice that node E has the highest degree. It is connected to the most things. And so perhaps it makes sense to begin our search, rather than starting at node A at the very top, start with the node with the highest degree. Start by searching from node E, because from there, that’s going to much more easily allow us to enforce the constraints that are nearby, eliminating large portions of the search space that I might not need to search through. And in fact, by starting with E, we can immediately then assign other variables. And following that, we can actually assign the rest of the variables without needing to do any backtracking at all, even if I’m not using this inference procedure. Just by starting with a node that has a high degree, that is going to very quickly restrict the possible values that other nodes can take on. So that then is how we can go about selecting an unassigned variable in a particular order. Rather than randomly picking a variable, if we’re a little bit intelligent about how we choose it, we can make our search process much, much more efficient by making sure we don’t have to search through portions of the search space that ultimately aren’t going to matter. The other variable we haven’t really talked about, the other function here, is this domain values function. This domain values function that takes a variable and gives me back a sequence of all of the values inside of that variable’s domain. The naive way to approach it is what we did before, which is just go in order, go Monday, then Tuesday, then Wednesday. But the problem is that going in that order might not be the most efficient order to search in, that sometimes it might be more efficient to choose values that are likely to be solutions first and then go to other values. Now, how do you assess whether a value is likelier to lead to a solution or less likely to lead to a solution? Well, one thing you can take a look at is how many constraints get added, how many things get removed from domains as you make this new assignment of a variable to this particular value. And the heuristic we can use here is the least constraining value heuristic, which is the idea that we should return variables in order based on the number of choices that are ruled out for neighboring values. And I want to start with the least constraining value, the value that rules out the fewest possible options. And the idea there is that if all I care about doing is finding a solution, if I start with a value that rules out a lot of other choices, I’m ruling out a lot of possibilities that maybe is going to make it less likely that this particular choice leads to a solution. Whereas on the other hand, if I have a variable and I start by choosing a value that doesn’t rule out very much, well, then I still have a lot of space where there might be a solution that I could ultimately find. And this might seem a little bit counterintuitive and a little bit at odds with what we were talking about before, where I said, when you’re picking a variable, you should pick the variable that is going to have the fewest possible values remaining. But here, I want to pick the value for the variable that is the least constraining. But the general idea is that when I am picking a variable, I would like to prune large portions of the search space by just choosing a variable that is going to allow me to quickly eliminate possible options. Whereas here, within a particular variable, as I’m considering values that that variable could take on, I would like to just find a solution. And so what I want to do is ultimately choose a value that still leaves open the possibility of me finding a solution to be as likely as possible. By not ruling out many options, I leave open the possibility that I can still find a solution without needing to go back later and backtrack. So an example of that might be in this particular situation here, if I’m trying to choose a variable for a value for node C here, that C is equal to either Tuesday or Wednesday. We know it can’t be Monday because it conflicts with this domain here, where we already know that A is Monday, so C must be Tuesday or Wednesday. And the question is, should I try Tuesday first, or should I try Wednesday first? And if I try Tuesday, what gets ruled out? Well, one option gets ruled out here, a second option gets ruled out here, and a third option gets ruled out here. So choosing Tuesday would rule out three possible options. And what about choosing Wednesday? Well, choosing Wednesday would rule out one option here, and it would rule out one option there. And so I have two choices. I can choose Tuesday that rules out three options, or Wednesday that rules out two options. And according to the least constraining value heuristic, what I should probably do is go ahead and choose Wednesday, the one that rules out the fewest number of possible options, leaving open as many chances as possible for me to eventually find the solution inside of the state space. And ultimately, if you continue this process, we will find the solution, an assignment of variables, two values, that allows us to give each of these exams, each of these classes, an exam date that doesn’t conflict with anyone that happens to be enrolled in two classes at the same time. So the big takeaway now with all of this is that there are a number of different ways we can formulate a problem. The ways we’ve looked at today are we can formulate a problem as a local search problem, a problem where we’re looking at a current node and moving to a neighbor based on whether that neighbor is better or worse than the current node that we are looking at. We looked at formulating problems as linear programs, where just by putting things in terms of equations and constraints, we’re able to solve problems a little bit more efficiently. And we saw formulating a problem as a constraint satisfaction problem, creating this graph of all of the constraints that connect two variables that have some constraint between them, and using that information to be able to figure out what the solution should be. And so the takeaway of all of this now is that if we have some problem in artificial intelligence that we would like to use AI to be able to solve them, whether that’s trying to figure out where hospitals should be or trying to solve the traveling salesman problem, trying to optimize productions and costs and whatnot, or trying to figure out how to satisfy certain constraints, whether that’s in a Sudoku puzzle, or whether that’s in trying to figure out how to schedule exams for a university, or any number of a wide variety of types of problems, if we can formulate that problem as one of these sorts of problems, then we can use these known algorithms, these algorithms for enforcing art consistency and backtracking search, these hill climbing and simulated annealing algorithms, these simplex algorithms and interior point algorithms that can be used to solve linear programs, that we can use those techniques to begin to solve a whole wide variety of problems all in this world of optimization inside of artificial intelligence. This was an introduction to artificial intelligence with Python for today. We will see you next time. [” All right. Welcome back, everyone, to an introduction to artificial intelligence with Python. Now, so far in this class, we’ve used AI to solve a number of different problems, giving AI instructions for how to search for a solution, or how to satisfy certain constraints in order to find its way from some input point to some output point in order to solve some sort of problem. Today, we’re going to turn to the world of learning, in particular the idea of machine learning, which generally refers to the idea where we are not going to give the computer explicit instructions for how to perform a task, but rather we are going to give the computer access to information in the form of data, or patterns that it can learn from, and let the computer try and figure out what those patterns are, try and understand that data to be able to perform a task on its own. Now, machine learning comes in a number of different forms, and it’s a very wide field. So today, we’ll explore some of the foundational algorithms and ideas that are behind a lot of the different areas within machine learning. And one of the most popular is the idea of supervised machine learning, or just supervised learning. And supervised learning is a particular type of task. It refers to the task where we give the computer access to a data set, where that data set consists of input-output pairs. And what we would like the computer to do is we would like our AI to be able to figure out some function that maps inputs to outputs. So we have a whole bunch of data that generally consists of some kind of input, some evidence, some information that the computer will have access to. And we would like the computer, based on that input information, to predict what some output is going to be. And we’ll give it some data so that the computer can train its model on and begin to understand how it is that this information works and how it is that the inputs and outputs relate to each other. But ultimately, we hope that our computer will be able to figure out some function that, given those inputs, is able to get those outputs. There are a couple of different tasks within supervised learning. The one we’ll focus on and start with is known as classification. And classification is the problem where, if I give you a whole bunch of inputs, you need to figure out some way to map those inputs into discrete categories, where you can decide what those categories are, and it’s the job of the computer to predict what those categories are going to be. So that might be, for example, I give you information about a bank note, like a US dollar, and I’m asking you to predict for me, does it belong to the category of authentic bank notes, or does it belong to the category of counterfeit bank notes? You need to categorize the input, and we want to train the computer to figure out some function to be able to do that calculation. Another example might be the case of weather, someone we’ve talked about a little bit so far in this class, where we would like to predict on a given day, is it going to rain on that day? Is it going to be cloudy on that day? And before we’ve seen how we could do this, if we really give the computer all the exact probabilities for if these are the conditions, what’s the probability of rain? Oftentimes, we don’t have access to that information, though. But what we do have access to is a whole bunch of data. So if we wanted to be able to predict something like, is it going to rain or is it not going to rain, we would give the computer historical information about days when it was raining and days when it was not raining and ask the computer to look for patterns in that data. So what might that data look like? Well, we could structure that data in a table like this. This might be what our table looks like, where for any particular day, going back, we have information about that day’s humidity, that day’s air pressure, and then importantly, we have a label, something where the human has said that on this particular day, it was raining or it was not raining. So you could fill in this table with a whole bunch of data. And what makes this what we would call a supervised learning exercise is that a human has gone in and labeled each of these data points, said that on this day, when these were the values for the humidity and pressure, that day was a rainy day and this day was a not rainy day. And what we would like the computer to be able to do then is to be able to figure out, given these inputs, given the humidity and the pressure, can the computer predict what label should be associated with that day? Does that day look more like it’s going to be a day that rains or does it look more like a day when it’s not going to rain? Put a little bit more mathematically, you can think of this as a function that takes two inputs, the inputs being the data points that our computer will have access to, things like humidity and pressure. So we could write a function f that takes as input both humidity and pressure. And then the output is going to be what category we would ascribe to these particular input points, what label we would associate with that input. So we’ve seen a couple of example data points here, where given this value for humidity and this value for pressure, we predict, is it going to rain or is it not going to rain? And that’s information that we just gathered from the world. We measured on various different days what the humidity and pressure were. We observed whether or not we saw rain or no rain on that particular day. And this function f is what we would like to approximate. Now, the computer and we humans don’t really know exactly how this function f works. It’s probably quite a complex function. So what we’re going to do instead is attempt to estimate it. We would like to come up with a hypothesis function. h, which is going to try to approximate what f does. We want to come up with some function h that will also take the same inputs and will also produce an output, rain or no rain. And ideally, we’d like these two functions to agree as much as possible. So the goal then of the supervised learning classification tasks is going to be to figure out, what does that function h look like? How can we begin to estimate, given all of this information, all of this data, what category or what label should be assigned to a particular data point? So where could you begin doing this? Well, a reasonable thing to do, especially in this situation, I have two numerical values, is I could try to plot this on a graph that has two axes, an x-axis and a y-axis. And in this case, we’re just going to be using two numerical values as input. But these same types of ideas scale as you add more and more inputs as well. We’ll be plotting things in two dimensions. But as we soon see, you could add more inputs and just imagine things in multiple dimensions. And while we humans have trouble conceptualizing anything really beyond three dimensions, at least visually, a computer has no problem with trying to imagine things in many, many more dimensions, that for a computer, each dimension is just some separate number that it is keeping track of. So it wouldn’t be unreasonable for a computer to think in 10 dimensions or 100 dimensions to be able to try to solve a problem. But for now, we’ve got two inputs. So we’ll graph things along two axes, an x-axis, which will here represent humidity, and a y-axis, which here represents pressure. And what we might do is say, let’s take all of the days that were raining and just try to plot them on this graph and see where they fall on this graph. And here might be all of the rainy days, where each rainy day is one of these blue dots here that corresponds to a particular value for humidity and a particular value for pressure. And then I might do the same thing with the days that were not rainy. So take all the not rainy days, figure out what their values were for each of these two inputs, and go ahead and plot them on this graph as well. And I’ve here plotted them in red. So blue here stands for a rainy day. Red here stands for a not rainy day. And this then is the input that my computer has access to all of this input. And what I would like the computer to be able to do is to train a model such that if I’m ever presented with a new input that doesn’t have a label associated with it, something like this white dot here, I would like to predict, given those values for each of the two inputs, should we classify it as a blue dot, a rainy day, or should we classify it as a red dot, a not rainy day? And if you’re just looking at this picture graphically, trying to say, all right, this white dot, does it look like it belongs to the blue category, or does it look like it belongs to the red category, I think most people would agree that it probably belongs to the blue category. And why is that? Well, it looks like it’s close to other blue dots. And that’s not a very formal notion, but it’s a notion that we’ll formalize in just a moment. That because it seems to be close to this blue dot here, nothing else is closer to it, then we might say that it should be categorized as blue. It should fall into that category of, I think that day is going to be a rainy day based on that input. Might not be totally accurate, but it’s a pretty good guess. And this type of algorithm is actually a very popular and common machine learning algorithm known as nearest neighbor classification. It’s an algorithm for solving these classification-type problems. And in nearest neighbor classification, it’s going to perform this algorithm. What it will do is, given an input, it will choose the class of the nearest data point to that input. By class, we just here mean category, like rain or no rain, counterfeit or not counterfeit. And we choose the category or the class based on the nearest data point. So given all that data, we just looked at, is the nearest data point a blue point or is it a red point? And depending on the answer to that question, we were able to make some sort of judgment. We were able to say something like, we think it’s going to be blue or we think it’s going to be red. So likewise, we could apply this to other data points that we encounter as well. If suddenly this data point comes about, well, its nearest data is red. So we would go ahead and classify this as a red point, not raining. Things get a little bit trickier, though, when you look at a point like this white point over here and you ask the same sort of question. Should it belong to the category of blue points, the rainy days? Or should it belong to the category of red points, the not rainy days? Now, nearest neighbor classification would say the way you solve this problem is look at which point is nearest to that point. You look at this nearest point and say it’s red. It’s a not rainy day. And therefore, according to nearest neighbor classification, I would say that this unlabeled point, well, that should also be red. It should also be classified as a not rainy day. But your intuition might think that that’s a reasonable judgment to make, that it’s the closest thing is a not rainy day. So may as well guess that it’s a not rainy day. But it’s probably also reasonable to look at the bigger picture of things to say, yes, it is true that the nearest point to it was a red point. But it’s surrounded by a whole bunch of other blue points. So looking at the bigger picture, there’s potentially an argument to be made that this point should actually be blue. And with only this data, we actually don’t know for sure. We are given some input, something we’re trying to predict. And we don’t necessarily know what the output is going to be. So in this case, which one is correct is difficult to say. But oftentimes, considering more than just a single neighbor, considering multiple neighbors can sometimes give us a better result. And so there’s a variant on the nearest neighbor classification algorithm that is known as the K nearest neighbor classification algorithm, where K is some parameter, some number that we choose, for how many neighbors are we going to look at. So one nearest neighbor classification is what we saw before. Just pick the one nearest neighbor and use that category. But with K nearest neighbor classification, where K might be 3, or 5, or 7, to say look at the 3, or 5, or 7 closest neighbors, closest data points to that point, works a little bit differently. This algorithm, we’ll give it an input. Choose the most common class out of the K nearest data points to that input. So if we look at the five nearest points, and three of them say it’s raining, and two of them say it’s not raining, we’ll go with the three instead of the two, because each one effectively gets one vote towards what they believe the category ought to be. And ultimately, you choose the category that has the most votes as a consequence of that. So K nearest neighbor classification, fairly straightforward one to understand intuitively. You just look at the neighbors and figure out what the answer might be. And it turns out this can work very, very well for solving a whole variety of different types of classification problems. But not every model is going to work under every situation. And so one of the things we’ll take a look at today, especially in the context of supervised machine learning, is that there are a number of different approaches to machine learning, a number of different algorithms that we can apply, all solving the same type of problem, all solving some kind of classification problem where we want to take inputs and organize it into different categories. And no one algorithm is necessarily always going to be better than some other algorithm. They each have their trade-offs. And maybe depending on the data, one type of algorithm is going to be better suited to trying to model that information than some other algorithm. And so this is what a lot of machine learning research ends up being about, that when you’re trying to apply machine learning techniques, you’re often looking not just at one particular algorithm, but trying multiple different algorithms, trying to see what is going to give you the best results for trying to predict some function that maps inputs to outputs. So what then are the drawbacks of K nearest neighbor classification? Well, there are a couple. One might be that in a naive approach, at least, it could be fairly slow to have to go through and measure the distance between a point and every single one of these points that exist here. Now, there are ways of trying to get around that. There are data structures that can help to make it more quickly to be able to find these neighbors. There are also techniques you can use to try and prune some of this data, remove some of the data points so that you’re only left with the relevant data points just to make it a little bit easier. But ultimately, what we might like to do is come up with another way of trying to do this classification. And one way of trying to do the classification was looking at what are the neighboring points. But another way might be to try to look at all of the data and see if we can come up with some decision boundary, some boundary that will separate the rainy days from the not rainy days. And in the case of two dimensions, we can do that by drawing a line, for example. So what we might want to try to do is just find some line, find some separator that divides the rainy days, the blue points over here, from the not rainy days, the red points over there. We’re now trying a different approach in contrast with the nearest neighbor approach, which just looked at local data around the input data point that we cared about. Now what we’re doing is trying to use a technique known as linear regression to find some sort of line that will separate the two halves from each other. Now sometimes it’ll actually be possible to come up with some line that perfectly separates all the rainy days from the not rainy days. Realistically, though, this is probably cleaner than many data sets will actually be. Oftentimes, data is messier. There are outliers. There’s random noise that happens inside of a particular system. And what we’d like to do is still be able to figure out what a line might look like. So in practice, the data will not always be linearly separable. Or linearly separable refers to some data set where I could draw a line just to separate the two halves of it perfectly. Instead, you might have a situation like this, where there are some rainy points that are on this side of the line and some not rainy points that are on that side of the line. And there may not be a line that perfectly separates what path of the inputs from the other half, that perfectly separates all the rainy days from the not rainy days. But we can still say that this line does a pretty good job. And we’ll try to formalize a little bit later what we mean when we say something like this line does a pretty good job of trying to make that prediction. But for now, let’s just say we’re looking for a line that does as good of a job as we can at trying to separate one category of things from another category of things. So let’s now try to formalize this a little bit more mathematically. We want to come up with some sort of function, some way we can define this line. And our inputs are things like humidity and pressure in this case. So our inputs we might call x1 is going to represent humidity, and x2 is going to represent pressure. These are inputs that we are going to provide to our machine learning algorithm. And given those inputs, we would like for our model to be able to predict some sort of output. And we are going to predict that using our hypothesis function, which we called h. Our hypothesis function is going to take as input x1 and x2, humidity and pressure in this case. And you can imagine if we didn’t just have two inputs, we had three or four or five inputs or more, we could have this hypothesis function take all of those as input. And we’ll see examples of that a little bit later as well. And now the question is, what does this hypothesis function do? Well, it really just needs to measure, is this data point on one side of the boundary, or is it on the other side of the boundary? And how do we formalize that boundary? Well, the boundary is generally going to be a linear combination of these input variables, at least in this particular case. So what we’re trying to do when we say linear combination is take each of these inputs and multiply them by some number that we’re going to have to figure out. We’ll generally call that number a weight for how important should these variables be in trying to determine the answer. So we’ll weight each of these variables with some weight, and we might add a constant to it just to try and make the function a little bit different. And the result, we just need to compare. Is it greater than 0, or is it less than 0 to say, does it belong on one side of the line or the other side of the line? So what that mathematical expression might look like is this. We would take each of my variables, x1 and x2, multiply them by some weight. I don’t yet know what that weight is, but it’s going to be some number, weight 1 and weight 2. And maybe we just want to add some other weight 0 to it, because the function might require us to shift the entire value up or down by a certain amount. And then we just compare. If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain. So the key here, then, is that this expression is how we are going to calculate whether it’s a rainy day or not. We’re going to do a bunch of math where we take each of the variables, multiply them by a weight, maybe add an extra weight to it, see if the result is greater than or equal to 0. And using that result of that expression, we’re able to determine whether it’s raining or not raining. This expression here is in this case going to refer to just some line. If you were to plot that graphically, it would just be some line. And what the line actually looks like depends upon these weights. x1 and x2 are the inputs, but these weights are really what determine the shape of that line, the slope of that line, and what that line actually looks like. So we then would like to figure out what these weights should be. We can choose whatever weights we want, but we want to choose weights in such a way that if you pass in a rainy day’s humidity and pressure, then you end up with a result that is greater than or equal to 0. And we would like it such that if we passed into our hypothesis function a not rainy day’s inputs, then the output that we get should be not raining. So before we get there, let’s try and formalize this a little bit more mathematically just to get a sense for how it is that you’ll often see this if you ever go further into supervised machine learning and explore this idea. One thing is that generally for these categories, we’ll sometimes just use the names of the categories like rain and not rain. Often mathematically, if we’re trying to do comparisons between these things, it’s easier just to deal in the world of numbers. So we could just say 1 and 0, 1 for raining, 0 for not raining. So we do all this math. And if the result is greater than or equal to 0, we’ll go ahead and say our hypothesis function outputs 1, meaning raining. And otherwise, it outputs 0, meaning not raining. And oftentimes, this type of expression will instead express using vector mathematics. And all a vector is, if you’re not familiar with the term, is it refers to a sequence of numerical values. You could represent that in Python using a list of numerical values or a tuple with numerical values. And here, we have a couple of sequences of numerical values. One of our vectors, one of our sequences of numerical values, are all of these individual weights, w0, w1, and w2. So we could construct what we’ll call a weight vector, and we’ll see why this is useful in a moment, called w, generally represented using a boldface w, that is just a sequence of these three weights, weight 0, weight 1, and weight 2. And to be able to calculate, based on those weights, whether we think a day is raining or not raining, we’re going to multiply each of those weights by one of our input variables. That w2, this weight, is going to be multiplied by input variable x2. w1 is going to be multiplied by input variable x1. And w0, well, it’s not being multiplied by anything. But to make sure the vectors are the same length, and we’ll see why that’s useful in just a second, we’ll just go ahead and say w0 is being multiplied by 1. Because you can multiply by something by 1, and you end up getting the exact same number. So in addition to the weight vector w, we’ll also have an input vector that we’ll call x that has three values, 1, again, because we’re just multiplying w0 by 1 eventually, and then x1 and x2. So here, then, we’ve represented two distinct vectors, a vector of weights that we need to somehow learn. The goal of our machine learning algorithm is to learn what this weight vector is supposed to be. We could choose any arbitrary set of numbers, and it would produce a function that tries to predict rain or not rain, but it probably wouldn’t be very good. What we want to do is come up with a good choice of these weights so that we’re able to do the accurate predictions. And then this input vector represents a particular input to the function, a data point for which we would like to estimate, is that day a rainy day, or is that day a not rainy day? And so that’s going to vary just depending on what input is provided to our function, what it is that we are trying to estimate. And then to do the calculation, we want to calculate this expression here, and it turns out that expression is what we would call the dot product of these two vectors. The dot product of two vectors just means taking each of the terms in the vectors and multiplying them together, w0 multiply it by 1, w1 multiply it by x1, w2 multiply it by x2, and that’s why these vectors need to be the same length. And then we just add all of the results together. So the dot product of w and x, our weight vector and our input vector, that’s just going to be w0 times 1, or just w0, plus w1 times x1, multiplying these two terms together, plus w2 times x2, multiplying those terms together. So we have our weight vector, which we need to figure out. We need our machine learning algorithm to figure out what the weights should be. We have the input vector representing the data point that we’re trying to predict a category for, predict a label for. And we’re able to do that calculation by taking this dot product, which you’ll often see represented in vector form. But if you haven’t seen vectors before, you can think of it as identical to just this mathematical expression, just doing the multiplication, adding the results together, and then seeing whether the result is greater than or equal to 0 or not. This expression here is identical to the expression that we’re calculating to see whether or not that answer is greater than or equal to 0 in this case. And so for that reason, you’ll often see the hypothesis function written as something like this, a simpler representation where the hypothesis takes as input some input vector x, some humidity and pressure for some day. And we want to predict an output like rain or no rain or 1 or 0 if we choose to represent things numerically. And the way we do that is by taking the dot product of the weights and our input. If it’s greater than or equal to 0, we’ll go ahead and say the output is 1. Otherwise, the output is going to be 0. And this hypothesis, we say, is parameterized by the weights. Depending on what weights we choose, we’ll end up getting a different hypothesis. If we choose the weights randomly, we’re probably not going to get a very good hypothesis function. We’ll get a 1 or a 0. But it’s probably not accurately going to reflect whether we think a day is going to be rainy or not rainy. But if we choose the weights right, we can often do a pretty good job of trying to estimate whether we think the output of the function should be a 1 or a 0. And so the question, then, is how to figure out what these weights should be, how to be able to tune those parameters. And there are a number of ways you can do that. One of the most common is known as the perceptron learning rule. And we’ll see more of this later. But the idea of the perceptron learning rule, and we’re not going to get too deep into the mathematics, we’ll mostly just introduce it more conceptually, is to say that given some data point that we would like to learn from, some data point that has an input x and an output y, where y is like 1 for rain or 0 for not rain, then we’re going to update the weights. And we’ll look at the formula in just a moment. But the big picture idea is that we can start with random weights, but then learn from the data. Take the data points one at a time. And for each one of the data points, figure out, all right, what parameters do we need to change inside of the weights in order to better match that input point. And so that is the value of having access to a lot of data in the supervised machine learning algorithm, is that you take each of the data points and maybe look at them multiple times and constantly try and figure out whether you need to shift your weights in order to better create some weight vector that is able to correctly or more accurately try to estimate what the output should be, whether we think it’s going to be raining or whether we think it’s not going to be raining. So what does that weight update look like? Without going into too much of the mathematics, we’re going to update each of the weights to be the result of the original weight plus some additional expression. And to understand this expression, y, well, y is what the actual output is. And hypothesis of x, the input, that’s going to be what we thought the input was. And so I can replace this by saying what the actual value was minus what our estimate was. And based on the difference between the actual value and what our estimate was, we might want to change our hypothesis, change the way that we do that estimation. If the actual value and the estimate were the same thing, meaning we were correctly able to predict what category this data point belonged to, well, then actual value minus estimate, that’s just going to be 0, which means this whole term on the right-hand side goes to be 0, and the weight doesn’t change. Weight i, where i is like weight 1 or weight 2 or weight 0, weight i just stays at weight i. And none of the weights change if we were able to correctly predict what category the input belonged to. But if our hypothesis didn’t correctly predict what category the input belonged to, well, then maybe then we need to make some changes, adjust the weights so that we’re better able to predict this kind of data point in the future. And what is the way we might do that? Well, if the actual value was bigger than the estimate, then, and for now we’ll go ahead and assume that these x’s are positive values, then if the actual value was bigger than the estimate, well, that means we need to increase the weight in order to make it such that the output is bigger, and therefore we’re more likely to get to the right actual value. And so if the actual value is bigger than the estimate, then actual value minus estimate, that’ll be a positive number. And so you imagine we’re just adding some positive number to the weight just to increase it ever so slightly. And likewise, the inverse case is true, that if the actual value was less than the estimate, the actual value was 0, but we estimated 1, meaning it actually was not raining, but we predicted it was going to be raining. Well, then we want to decrease the value of the weight, because then in that case, we want to try and lower the total value of computing that dot product in order to make it less likely that we would predict that it would actually be raining. So no need to get too deep into the mathematics of that, but the general idea is that every time we encounter some data point, we can adjust these weights accordingly to try and make the weights better line up with the actual data that we have access to. And you can repeat this process with data point after data point until eventually, hopefully, your algorithm converges to some set of weights that do a pretty good job of trying to figure out whether a day is going to be rainy or not raining. And just as a final point about this particular equation, this value alpha here is generally what we’ll call the learning rate. It’s just some parameter, some number we choose for how quickly we’re actually going to be updating these weight values. So that if alpha is bigger, then we’re going to update these weight values by a lot. And if alpha is smaller, then we’ll update the weight values by less. And you can choose a value of alpha. Depending on the problem, different values might suit the situation better or worse than others. So after all of that, after we’ve done this training process of take all this data and using this learning rule, look at all the pieces of data and use each piece of data as an indication to us of do the weights stay the same, do we increase the weights, do we decrease the weights, and if so, by how much? What you end up with is effectively a threshold function. And we can look at what the threshold function looks like like this. On the x-axis here, we have the output of that function, taking the weights, taking the dot product of it with the input. And on the y-axis, we have what the output is going to be, 0, which in this case represented not raining, and 1, which in this case represented raining. And the way that our hypothesis function works is it calculates this value. And if it’s greater than 0 or greater than some threshold value, then we declare that it’s a rainy day. And otherwise, we declare that it’s a not rainy day. And this then graphically is what that function looks like, that initially when the value of this dot product is small, it’s not raining, it’s not raining, it’s not raining. But as soon as it crosses that threshold, we suddenly say, OK, now it’s raining, now it’s raining, now it’s raining. And the way to interpret this kind of representation is that anything on this side of the line, that would be the category of data points where we say, yes, it’s raining. Anything that falls on this side of the line are the data points where we would say, it’s not raining. And again, we want to choose some value for the weights that results in a function that does a pretty good job of trying to do this estimation. But one tricky thing with this type of hard threshold is that it only leaves two possible outcomes. We plug in some data as input. And the output we get is raining or not raining. And there’s no room for anywhere in between. And maybe that’s what you want. Maybe all you want is given some data point, you would like to be able to classify it into one or two or more of these various different categories. But it might also be the case that you care about knowing how strong that prediction is, for example. So if we go back to this instance here, where we have rainy days on this side of the line, not rainy days on that side of the line, you might imagine that let’s look now at these two white data points. This data point here that we would like to predict a label or a category for. And this data point over here that we would also like to predict a label or a category for. It seems likely that you could pretty confidently say that this data point, that should be a rainy day. Seems close to the other rainy days if we’re going by the nearest neighbor strategy. It’s on this side of the line if we’re going by the strategy of just saying, which side of the line does it fall on by figuring out what those weights should be. And if we’re using the line strategy of just which side of the line does it fall on, which side of this decision boundary, well, we’d also say that this point here is also a rainy day because it falls on the side of the line that corresponds to rainy days. But it’s likely that even in this case, we would know that we don’t feel nearly as confident about this data point on the left as compared to this data point on the right. That for this one on the right, we can feel very confident that yes, it’s a rainy day. This one, it’s pretty close to the line if we’re judging just by distance. And so you might be less sure. But our threshold function doesn’t allow for a notion of less sure or more sure about something. It’s what we would call a hard threshold. It’s once you’ve crossed this line, then immediately we say, yes, this is going to be a rainy day. Anywhere before it, we’re going to say it’s not a rainy day. And that may not be helpful in a number of cases. One, this is not a particularly easy function to deal with. As you get deeper into the world of machine learning and are trying to do things like taking derivatives of these curves with this type of function makes things challenging. But the other challenge is that we don’t really have any notion of gradation between things. We don’t have a notion of yes, this is a very strong belief that it’s going to be raining as opposed to it’s probably more likely than not that it’s going to be raining, but maybe not totally sure about that either. So what we can do by taking advantage of a technique known as logistic regression is instead of using this hard threshold type of function, we can use instead a logistic function, something we might call a soft threshold. And that’s going to transform this into looking something a little more like this, something that more nicely curves. And as a result, the possible output values are no longer just 0 and 1, 0 for not raining, 1 for raining. But you can actually get any real numbered value between 0 and 1. But if you’re way over on this side, then you get a value of 0. OK, it’s not going to be raining, and we’re pretty sure about that. And if you’re over on this side, you get a value of 1. And yes, we’re very sure that it’s going to be raining. But in between, you could get some real numbered value, where a value like 0.7 might mean we think it’s going to rain. It’s more probable that it’s going to rain than not based on the data. But we’re not as confident as some of the other data points might be. So one of the advantages of the soft threshold is that it allows us to have an output that could be some real number that potentially reflects some sort of probability, the likelihood that we think that this particular data point belongs to that particular category. And there are some other nice mathematical properties of that as well. So that then is two different approaches to trying to solve this type of classification problem. One is this nearest neighbor type of approach, where you just take a data point and look at the data points that are nearby to try and estimate what category we think it belongs to. And the other approach is the approach of saying, all right, let’s just try and use linear regression, figure out what these weights should be, adjust the weights in order to figure out what line or what decision boundary is going to best separate these two categories. It turns out that another popular approach, a very popular approach if you just have a data set and you want to start trying to do some learning on it, is what we call the support vector machine. And we’re not going to go too much into the mathematics of the support vector machine, but we’ll at least explore it graphically to see what it is that it looks like. And the idea or the motivation behind the support vector machine is the idea that there are actually a lot of different lines that we could draw, a lot of different decision boundaries that we could draw to separate two groups. So for example, I had the red data points over here and the blue data points over here. One possible line I could draw is a line like this, that this line here would separate the red points from the blue points. And it does so perfectly. All the red points are on one side of the line. All the blue points are on the other side of the line. But this should probably make you a little bit nervous. If you come up with a model and the model comes up with a line that looks like this. And the reason why is that you worry about how well it’s going to generalize to other data points that are not necessarily in the data set that we have access to. For example, if there was a point that fell like right here, for example, on the right side of the line, well, then based on that, we might want to guess that it is, in fact, a red point, but it falls on the side of the line where instead we would estimate that it’s a blue point instead. And so based on that, this line is probably not a great choice just because it is so close to these various data points. We might instead prefer like a diagonal line that just goes diagonally through the data set like we’ve seen before. But there too, there’s a lot of diagonal lines that we could draw as well. For example, I could draw this diagonal line here, which also successfully separates all the red points from all of the blue points. From the perspective of something like just trying to figure out some setting of weights that allows us to predict the correct output, this line will predict the correct output for this particular set of data every single time because the red points are on one side, the blue points are on the other. But yet again, you should probably be a little nervous because this line is so close to these red points, even though we’re able to correctly predict on the input data, if there was a point that fell somewhere in this general area, our algorithm, this model, would say that, yeah, we think it’s a blue point, when in actuality, it might belong to the red category instead just because it looks like it’s close to the other red points. What we really want to be able to say, given this data, how can you generalize this as best as possible, is to come up with a line like this that seems like the intuitive line to draw. And the reason why it’s intuitive is because it seems to be as far apart as possible from the red data and the blue data. So that if we generalize a little bit and assume that maybe we have some points that are different from the input but still slightly further away, we can still say that something on this side probably red, something on that side probably blue, and we can make those judgments that way. And that is what support vector machines are designed to do. They’re designed to try and find what we call the maximum margin separator, where the maximum margin separator is just some boundary that maximizes the distance between the groups of points rather than come up with some boundary that’s very close to one set or the other, where in the case before, we wouldn’t have cared. As long as we’re categorizing the input well, that seems all we need to do. The support vector machine will try and find this maximum margin separator, some way of trying to maximize that particular distance. And it does so by finding what we call the support vectors, which are the vectors that are closest to the line, and trying to maximize the distance between the line and those particular points. And it works that way in two dimensions. It also works in higher dimensions, where we’re not looking for some line that separates the two data points, but instead looking for what we generally call a hyperplane, some decision boundary, effectively, that separates one set of data from the other set of data. And this ability of support vector machines to work in higher dimensions actually has a number of other applications as well. But one is that it helpfully deals with cases where data may not be linearly separable. So we talked about linear separability before, this idea that you can take data and just draw a line or some linear combination of the inputs that allows us to perfectly separate the two sets from each other. There are some data sets that are not linearly separable. And some were even two. You would not be able to find a good line at all that would try to do that kind of separation. Something like this, for example. Or if you imagine here are the red points and the blue points around it. If you try to find a line that divides the red points from the blue points, it’s actually going to be difficult, if not impossible, to do that any line you choose, well, if you draw a line here, then you ignore all of these blue points that should actually be blue and not red. Anywhere else you draw a line, there’s going to be a lot of error, a lot of mistakes, a lot of what we’ll soon call loss to that line that you draw, a lot of points that you’re going to categorize incorrectly. What we really want is to be able to find a better decision boundary that may not be just a straight line through this two dimensional space. And what support vector machines can do is they can begin to operate in higher dimensions and be able to find some other decision boundary, like the circle in this case, that actually is able to separate one of these sets of data from the other set of data a lot better. So oftentimes in data sets where the data is not linearly separable, support vector machines by working in higher dimensions can actually figure out a way to solve that kind of problem effectively. So that then, three different approaches to trying to solve these sorts of problems. We’ve seen support vector machines. We’ve seen trying to use linear regression and the perceptron learning rule to be able to figure out how to categorize inputs and outputs. We’ve seen the nearest neighbor approach. No one necessarily better than any other again. It’s going to depend on the data set, the information you have access to. It’s going to depend on what the function looks like that you’re ultimately trying to predict. And this is where a lot of research and experimentation can be involved in trying to figure out how it is to best perform that kind of estimation. But classification is only one of the tasks that you might encounter in supervised machine learning. Because in classification, what we’re trying to predict is some discrete category. We’re trying to predict red or blue, rain or not rain, authentic or counterfeit. But sometimes what we want to predict is a real numbered value. And for that, we have a related problem, not classification, but instead known as regression. And regression is the supervised learning problem where we try and learn a function mapping inputs to outputs same as before. But instead of the outputs being discrete categories, things like rain or not rain, in a regression problem, the output values are generally continuous values, some real number that we would like to predict. This happens all the time as well. You might imagine that a company might take this approach if it’s trying to figure out, for instance, what the effect of its advertising is. How do advertising dollars spent translate into sales for the company’s product, for example? And so they might like to try to predict some function that takes as input the amount of money spent on advertising. And here, we’re just going to use one input. But again, you could scale this up to many more inputs as well if you have a lot of different kinds of data you have access to. And the goal is to learn a function that given this amount of spending on advertising, we’re going to get this amount in sales. And you might judge, based on having access to a whole bunch of data, like for every past month, here is how much we spent on advertising, and here is what sales were. And we would like to predict some sort of hypothesis function that, again, given the amount spent on advertising, we can predict, in this case, some real number, some number estimate of how much sales we expect that company to do in this month or in this quarter or whatever unit of time we’re choosing to measure things in. And so again, the approach to solving this type of problem, we could try using a linear regression type approach where we take this data and we just plot it. On the x-axis, we have advertising dollars spent. On the y-axis, we have sales. And we might just want to try and draw a line that does a pretty good job of trying to estimate this relationship between advertising and sales. And in this case, unlike before, we’re not trying to separate the data points into discrete categories. But instead, in this case, we’re just trying to find a line that approximates this relationship between advertising and sales so that if we want to figure out what the estimated sales are for a particular advertising budget, you just look it up in this line, figure out for this amount of advertising, we would have this amount of sales and just try and make the estimate that way. And so you can try and come up with a line, again, figuring out how to modify the weights using various different techniques to try and make it so that this line fits as well as possible. So with all of these approaches, then, to trying to solve machine learning style problems, the question becomes, how do we evaluate these approaches? How do we evaluate the various different hypotheses that we could come up with? Because each of these algorithms will give us some sort of hypothesis, some function that maps inputs to outputs, and we want to know, how well does that function work? And you can think of evaluating these hypotheses and trying to get a better hypothesis as kind of like an optimization problem. In an optimization problem, as you recall from before, we were either trying to maximize some objective function by trying to find a global maximum, or we were trying to minimize some cost function by trying to find some global minimum. And in the case of evaluating these hypotheses, one thing we might say is that this cost function, the thing we’re trying to minimize, we might be trying to minimize what we would call a loss function. And what a loss function is, is it is a function that is going to estimate for us how poorly our function performs. More formally, it’s like a loss of utility by whenever we predict something that is wrong, that is a loss of utility. That’s going to add to the output of our loss function. And you could come up with any loss function that you want, just some mathematical way of estimating, given each of these data points, given what the actual output is, and given what our projected output is, our estimate, you could calculate some sort of numerical loss for it. But there are a couple of popular loss functions that are worth discussing, just so that you’ve seen them before. When it comes to discrete categories, things like rain or not rain, counterfeit or not counterfeit, one approaches the 0, 1 loss function. And the way that works is for each of the data points, our loss function takes as input what the actual output is, like whether it was actually raining or not raining, and takes our prediction into account. Did we predict, given this data point, that it was raining or not raining? And if the actual value equals the prediction, well, then the 0, 1 loss function will just say the loss is 0. There was no loss of utility, because we were able to predict correctly. And otherwise, if the actual value was not the same thing as what we predicted, well, then in that case, our loss is 1. We lost something, lost some utility, because what we predicted was the output of the function, was not what it actually was. And the goal, then, in a situation like this would be to come up with some hypothesis that minimizes the total empirical loss, the total amount that we’ve lost, if you add up for all these data points what the actual output is and what your hypothesis would have predicted. So in this case, for example, if we go back to classifying days as raining or not raining, and we came up with this decision boundary, how would we evaluate this decision boundary? How much better is it than drawing the line here or drawing the line there? Well, we could take each of the input data points, and each input data point has a label, whether it was raining or whether it was not raining. And we could compare it to the prediction, whether we predicted it would be raining or not raining, and assign it a numerical value as a result. So for example, these points over here, they were all rainy days, and we predicted they would be raining, because they fall on the bottom side of the line. So they have a loss of 0, nothing lost from those situations. And likewise, same is true for some of these points over here, where it was not raining and we predicted it would not be raining either. Where we do have loss are points like this point here and that point there, where we predicted that it would not be raining, but in actuality, it’s a blue point. It was raining. Or likewise here, we predicted that it would be raining, but in actuality, it’s a red point. It was not raining. And so as a result, we miscategorized these data points that we were trying to train on. And as a result, there is some loss here. One loss here, there, here, and there, for a total loss of 4, for example, in this case. And that might be how we would estimate or how we would say that this line is better than a line that goes somewhere else or a line that’s further down, because this line might minimize the loss. So there is no way to do better than just these four points of loss if you’re just drawing a straight line through our space. So the 0, 1 loss function checks. Did we get it right? Did we get it wrong? If we got it right, the loss is 0, nothing lost. If we got it wrong, then our loss function for that data point says 1. And we add up all of those losses across all of our data points to get some sort of empirical loss, how much we have lost across all of these original data points that our algorithm had access to. There are other forms of loss as well that work especially well when we deal with more real valued cases, cases like the mapping between advertising budget and amount that we do in sales, for example. Because in that case, you care not just that you get the number exactly right, but you care how close you were to the actual value. If the actual value is you did like $2,800 in sales and you predicted that you would do $2,900 in sales, maybe that’s pretty good. That’s much better than if you had predicted you’d do $1,000 in sales, for example. And so we would like our loss function to be able to take that into account as well, take into account not just whether the actual value and the expected value are exactly the same, but also take into account how far apart they were. And so for that one approach is what we call L1 loss. L1 loss doesn’t just look at whether actual and predicted are equal to each other, but we take the absolute value of the actual value minus the predicted value. In other words, we just ask how far apart were the actual and predicted values, and we sum that up across all of the data points to be able to get what our answer ultimately is. So what might this actually look like for our data set? Well, if we go back to this representation where we had advertising along the x-axis, sales along the y-axis, our line was our prediction, our estimate for any given amount of advertising, what we predicted sales was going to be. And our L1 loss is just how far apart vertically along the sales axis our prediction was from each of the data points. So we could figure out exactly how far apart our prediction was from each of the data points and figure out as a result of that what our loss is overall for this particular hypothesis just by adding up all of these various different individual losses for each of these data points. And our goal then is to try and minimize that loss, to try and come up with some line that minimizes what the utility loss is by judging how far away our estimate amount of sales is from the actual amount of sales. And turns out there are other loss functions as well. One that’s quite popular is the L2 loss. The L2 loss, instead of just using the absolute value, like how far away the actual value is from the predicted value, it uses the square of actual minus predicted. So how far apart are the actual and predicted value? And it squares that value, effectively penalizing much more harshly anything that is a worse prediction. So you imagine if you have two data points that you predict as being one value away from their actual value, as opposed to one data point that you predict as being two away from its actual value, the L2 loss function will more harshly penalize that one that is two away, because it’s going to square, however, much the differences between the actual value and the predicted value. And depending on the situation, you might want to choose a loss function depending on what you care about minimizing. If you really care about minimizing the error on more outlier cases, then you might want to consider something like this. But if you’ve got a lot of outliers, and you don’t necessarily care about modeling them, then maybe an L1 loss function is preferable. But there are trade-offs here that you need to decide, based on a particular set of data. But what you do run the risk of with any of these loss functions, with anything that we’re trying to do, is a problem known as overfitting. And overfitting is a big problem that you can encounter in machine learning, which happens anytime a model fits too closely with a data set, and as a result, fails to generalize. We would like our model to be able to accurately predict data and inputs and output pairs for the data that we have access to. But the reason we wanted to do so is because we want our model to generalize well to data that we haven’t seen before. I would like to take data from the past year of whether it was raining or not raining, and use that data to generalize it towards the future. Say, in the future, is it going to be raining or not raining? Or if I have a whole bunch of data on what counterfeit and not counterfeit US dollar bills look like in the past when people have encountered them, I’d like to train a computer to be able to, in the future, generalize to other dollar bills that I might see as well. And the problem with overfitting is that if you try and tie yourself too closely to the data set that you’re training your model on, you can end up not generalizing very well. So what does this look like? Well, we might imagine the rainy day and not rainy day example again from here, where the blue points indicate rainy days and the red points indicate not rainy days. And we decided that we felt pretty comfortable with drawing a line like this as the decision boundary between rainy days and not rainy days. So we can pretty comfortably say that points on this side more likely to be rainy days, points on that side more likely to be not rainy days. But the loss, the empirical loss, isn’t zero in this particular case because we didn’t categorize everything perfectly. There was this one outlier, this one day that it wasn’t raining, but yet our model still predicts that it is raining. But that doesn’t necessarily mean our model is bad. It just means the model isn’t 100% accurate. If you really wanted to try and find a hypothesis that resulted in minimizing the loss, you could come up with a different decision boundary. It wouldn’t be a line, but it would look something like this. This decision boundary does separate all of the red points from all of the blue points because the red points fall on this side of this decision boundary, the blue points fall on the other side of the decision boundary. But this, we would probably argue, is not as good of a prediction. Even though it seems to be more accurate based on all of the available training data that we have for training this machine learning model, we might say that it’s probably not going to generalize well. That if there were other data points like here and there, we might still want to consider those to be rainy days because we think this was probably just an outlier. So if the only thing you care about is minimizing the loss on the data you have available to you, you run the risk of overfitting. And this can happen in the classification case. It can also happen in the regression case, that here we predicted what we thought was a pretty good line relating advertising to sales, trying to predict what sales were going to be for a given amount of advertising. But I could come up with a line that does a better job of predicting the training data, and it would be something that looks like this, just connecting all of the various different data points. And now there is no loss at all. Now I’ve perfectly predicted, given any advertising, what sales are. And for all the data available to me, it’s going to be accurate. But it’s probably not going to generalize very well. I have overfit my model on the training data that is available to me. And so in general, we want to avoid overfitting. We’d like strategies to make sure that we haven’t overfit our model to a particular data set. And there are a number of ways that you could try to do this. One way is by examining what it is that we’re optimizing for. In an optimization problem, all we do is we say, there is some cost, and I want to minimize that cost. And so far, we’ve defined that cost function, the cost of a hypothesis, just as being equal to the empirical loss of that hypothesis, like how far away are the actual data points, the outputs, away from what I predicted them to be based on that particular hypothesis. And if all you’re trying to do is minimize cost, meaning minimizing the loss in this case, then the result is going to be that you might overfit, that to minimize cost, you’re going to try and find a way to perfectly match all the input data. And that might happen as a result of overfitting on that particular input data. So in order to address this, you could add something to the cost function. What counts as cost will not just loss, but also some measure of the complexity of the hypothesis. The word the complexity of the hypothesis is something that you would need to define for how complicated does our line look. This is sort of an Occam’s razor-style approach where we want to give preference to a simpler decision boundary, like a straight line, for example, some simpler curve, as opposed to something far more complex that might represent the training data better but might not generalize as well. We’ll generally say that a simpler solution is probably the better solution and probably the one that is more likely to generalize well to other inputs. So we measure what the loss is, but we also measure the complexity. And now that all gets taken into account when we consider the overall cost, that yes, something might have less loss if it better predicts the training data, but if it’s much more complex, it still might not be the best option that we have. And we need to come up with some balance between loss and complexity. And for that reason, you’ll often see this represented as multiplying the complexity by some parameter that we have to choose, parameter lambda in this case, where we’re saying if lambda is a greater value, then we really want to penalize more complex hypotheses. Whereas if lambda is smaller, we’re going to penalize more complex hypotheses a little bit, and it’s up to the machine learning programmer to decide where they want to set that value of lambda for how much do I want to penalize a more complex hypothesis that might fit the data a little better. And again, there’s no one right answer to a lot of these things, but depending on the data set, depending on the data you have available to you and the problem you’re trying to solve, your choice of these parameters may vary, and you may need to experiment a little bit to figure out what the right choice of that is ultimately going to be. This process, then, of considering not only loss, but also some measure of the complexity is known as regularization. Regularization is the process of penalizing a hypothesis that is more complex in order to favor a simpler hypothesis that is more likely to generalize well, more likely to be able to apply to other situations that are dealing with other input points unlike the ones that we’ve necessarily seen before. So oftentimes, you’ll see us add some regularizing term to what we’re trying to minimize in order to avoid this problem of overfitting. Now, another way of making sure we don’t overfit is to run some experiments and to see whether or not we are able to generalize our model that we’ve created to other data sets as well. And it’s for that reason that oftentimes when you’re doing a machine learning experiment, when you’ve got some data and you want to try and come up with some function that predicts, given some input, what the output is going to be, you don’t necessarily want to do your training on all of the data you have available to you that you could employ a method known as holdout cross-validation, where in holdout cross-validation, we split up our data. We split up our data into a training set and a testing set. The training set is the set of data that we’re going to use to train our machine learning model. And the testing set is the set of data that we’re going to use in order to test to see how well our machine learning model actually performed. So the learning happens on the training set. We figure out what the parameters should be. We figure out what the right model is. And then we see, all right, now that we’ve trained the model, we’ll see how well it does at predicting things inside of the testing set, some set of data that we haven’t seen before. And the hope then is that we’re going to be able to predict the testing set pretty well if we’re able to generalize based on the training data that’s available to us. If we’ve overfit the training data, though, and we’re not able to generalize, well, then when we look at the testing set, it’s likely going to be the case that we’re not going to predict things in the testing set nearly as effectively. So this is one method of cross-validation, validating to make sure that the work we have done is actually going to generalize to other data sets as well. And there are other statistical techniques we can use as well. One of the downsides of this just hold out cross-validation is if you say I just split it 50-50, I train using 50% of the data and test using the other 50%, or you could choose other percentages as well, is that there is a fair amount of data that I am now not using to train, that I might be able to get a better model as a result, for example. So one approach is known as k-fold cross-validation. In k-fold cross-validation, rather than just divide things into two sets and run one experiment, we divide things into k different sets. So maybe I divide things up into 10 different sets and then run 10 different experiments. So if I split up my data into 10 different sets of data, then what I’ll do is each time for each of my 10 experiments, I will hold out one of those sets of data, where I’ll say, let me train my model on these nine sets, and then test to see how well it predicts on set number 10. And then pick another set of nine sets to train on, and then test it on the other one that I held out, where each time I train the model on everything minus the one set that I’m holding out, and then test to see how well our model performs on the test that I did hold out. And what you end up getting is 10 different results, 10 different answers for how accurately our model worked. And oftentimes, you could just take the average of those 10 to get an approximation for how well we think our model performs overall. But the key idea is separating the training data from the testing data, because you want to test your model on data that is different from what you trained the model on. Because the training, you want to avoid overfitting. You want to be able to generalize. And the way you test whether you’re able to generalize is by looking at some data that you haven’t seen before and seeing how well we’re actually able to perform. And so if we want to actually implement any of these techniques inside of a programming language like Python, number of ways we could do that. We could write this from scratch on our own, but there are libraries out there that allow us to take advantage of existing implementations of these algorithms, that we can use the same types of algorithms in a lot of different situations. And so there’s a library, very popular one, known as Scikit-learn, which allows us in Python to be able to very quickly get set up with a lot of these different machine learning models. This library has already written an algorithm for nearest neighbor classification, for doing perceptron learning, for doing a bunch of other types of inference and supervised learning that we haven’t yet talked about. But using it, we can begin to try actually testing how these methods work and how accurately they perform. So let’s go ahead and take a look at one approach to trying to solve this type of problem. All right, so I’m first going to pull up banknotes.csv, which is a whole bunch of data provided by UC Irvine, which is information about various different banknotes that people took pictures of various different banknotes and measured various different properties of those banknotes. And in particular, some human categorized each of those banknotes as either a counterfeit banknote or as not counterfeit. And so what you’re looking at here is each row represents one banknote. This is formatted as a CSV spreadsheet, where just comma separated values separating each of these various different fields. We have four different input values for each of these data points, just information, some measurement that was made on the banknote. And what those measurements exactly are aren’t as important as the fact that we do have access to this data. But more importantly, we have access for each of these data points to a label, where 0 indicates something like this was not a counterfeit bill, meaning it was an authentic bill. And a data point labeled 1 means that it is a counterfeit bill, at least according to the human researcher who labeled this particular data. So we have a whole bunch of data representing a whole bunch of different data points, each of which has these various different measurements that were made on that particular bill, and each of which has an output value, 0 or 1, 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. And what we would like to do is use supervised learning to begin to predict or model some sort of function that can take these four values as input and predict what the output would be. We want our learning algorithm to find some sort of pattern that is able to predict based on these measurements, something that you could measure just by taking a photo of a bill, predict whether that bill is authentic or whether that bill is counterfeit. And so how can we do that? Well, I’m first going to open up banknote0.py and see how it is that we do this. I’m first importing a lot of things from Scikit-learn, but importantly, I’m going to set my model equal to the perceptron model, which is one of those models that we talked about before. We’re just going to try and figure out some setting of weights that is able to divide our data into two different groups. Then I’m going to go ahead and read data in for my file from banknotes.csv. And basically, for every row, I’m going to separate that row into the first four values of that row, which is the evidence for that row. And then the label, where if the final column in that row is a 0, the label is authentic. And otherwise, it’s going to be counterfeit. So I’m effectively reading data in from the CSV file, dividing into a whole bunch of rows where each row has some evidence, those four input values that are going to be inputs to my hypothesis function. And then the label, the output, whether it is authentic or counterfeit, that is the thing that I am then trying to predict. So the next step is that I would like to split up my data set into a training set and a testing set, some set of data that I would like to train my machine learning model on, and some set of data that I would like to use to test that model, see how well it performed. So what I’ll do is I’ll go ahead and figure out length of the data, how many data points do I have. I’ll go ahead and take half of them, save that number as a number called holdout. That is how many items I’m going to hold out for my data set to save for the testing phase. I’ll randomly shuffle the data so it’s in some random order. And then I’ll say my testing set will be all of the data up to the holdout. So I’ll take holdout many data items, and that will be my testing set. My training data will be everything else, the information that I’m going to train my model on. And then I’ll say I need to divide my training data into two different sets. I need to divide it into my x values, where x here represents the inputs. So the x values, the x values that I’m going to train on, are basically for every row in my training set, I’m going to get the evidence for that row, those four values, where it’s basically a vector of four numbers, where that is going to be all of the input. And then I need the y values. What are the outputs that I want to learn from, the labels that belong to each of these various different input points? Well, that’s going to be the same thing for each row in the training data. But this time, I take that row and get what its label is, whether it is authentic or counterfeit. So I end up with one list of all of these vectors of my input data, and one list, which follows the same order, but is all of the labels that correspond with each of those vectors. And then to train my model, which in this case is just this perceptron model, I just call model.fit, pass in the training data, and what the labels for those training data are. And scikit-learn will take care of fitting the model, will do the entire algorithm for me. And then when it’s done, I can then test to see how well that model performed. So I can say, let me get all of these input vectors for what I want to test on. So for each row in my testing data set, go ahead and get the evidence. And the y values, those are what the actual values were for each of the rows in the testing data set, what the actual label is. But then I’m going to generate some predictions. I’m going to use this model and try and predict, based on the testing vectors, I want to predict what the output is. And my goal then is to now compare y testing with predictions. I want to see how well my predictions, based on the model, actually reflect what the y values were, what the output is, that were actually labeled. Because I now have this label data, I can assess how well the algorithm worked. And so now I can just compute how well we did. I’m going to, this zip function basically just lets me look through two different lists, one by one at the same time. So for each actual value and for each predicted value, if the actual is the same thing as what I predicted, I’ll go ahead and increment the counter by one. Otherwise, I’ll increment my incorrect counter by one. And so at the end, I can print out, here are the results, here’s how many I got right, here’s how many I got wrong, and here was my overall accuracy, for example. So I can go ahead and run this. I can run python banknote0.py. And it’s going to train on half the data set and then test on half the data set. And here are the results for my perceptron model. In this case, it correctly was able to classify 679 bills as correctly either authentic or counterfeit and incorrectly classified seven of them for an overall accuracy of close to 99% accurate. So on this particular data set, using this perceptron model, we were able to predict very well what the output was going to be. And we can try different models, too, that scikit-learn makes it very easy just to swap out one model for another model. So instead of the perceptron model, I can use the support vector machine using the SVC, otherwise known as a support vector classifier, using a support vector machine to classify things into two different groups. And now see, all right, how well does this perform? And all right, this time, we were able to correctly predict 682 and incorrectly predicted four for accuracy of 99.4%. And we could even try the k-neighbors classifier as the model instead. And this takes a parameter, n neighbors, for how many neighbors do you want to look at? Let’s just look at one neighbor, the one nearest neighbor, and use that to predict. Go ahead and run this as well. And it looks like, based on the k-neighbors classifier, looking at just one neighbor, we were able to correctly classify 685 data points, incorrectly classified one. Maybe let’s try three neighbors instead, instead of just using one neighbor. Do more of a k-nearest neighbors approach, where I look at the three nearest neighbors and see how that performs. And that one, in this case, seems to have gotten 100% of all of the predictions correctly described as either authentic banknotes or as counterfeit banknotes. And we could run these experiments multiple times, because I’m randomly reorganizing the data every time. We’re technically training these on slightly different data sets. And so you might want to run multiple experiments to really see how well they’re actually going to perform. But in short, they all perform very well. And while some of them perform slightly better than others here, that might not always be the case for every data set. But you can begin to test now by very quickly putting together these machine learning models using Scikit-learn to be able to train on some training set and then test on some testing set as well. And this splitting up into training groups and testing groups and testing happens so often that Scikit-learn has functions built in for trying to do it. I did it all by hand just now. But if we take a look at banknotes one, we take advantage of some other features that exist in Scikit-learn, where we can really simplify a lot of our logic, that there is a function built into Scikit-learn called train test split, which will automatically split data into a training group and a testing group. I just have to say what proportion should be in the testing group, something like 0.5, half the data inside the testing group. Then I can fit the model on the training data, make the predictions on the testing data, and then just count up. And Scikit-learn has some nice methods for just counting up how many times our testing data match the predictions, how many times our testing data didn’t match the predictions. So very quickly, you can write programs with not all that many lines of code. It’s maybe like 40 lines of code to get through all of these predictions. And then as a result, see how well we’re able to do. So these types of libraries can allow us, without really knowing the implementation details of these algorithms, to be able to use the algorithms in a very practical way to be able to solve these types of problems. So that then was supervised learning, this task of given a whole set of data, some input output pairs, we would like to learn some function that maps those inputs to those outputs. But turns out there are other forms of learning as well. And another popular type of machine learning, especially nowadays, is known as reinforcement learning. And the idea of reinforcement learning is rather than just being given a whole data set at the beginning of input output pairs, reinforcement learning is all about learning from experience. In reinforcement learning, our agent, whether it’s like a physical robot that’s trying to make actions in the world or just some virtual agent that is a program running somewhere, our agent is going to be given a set of rewards or punishments in the form of numerical values. But you can think of them as reward or punishment. And based on that, it learns what actions to take in the future, that our agent, our AI, will be put in some sort of environment. It will make some actions. And based on the actions that it makes, it learns something. It either gets a reward when it does something well, it gets a punishment when it does something poorly, and it learns what to do or what not to do in the future based on those individual experiences. And so what this will often look like is it will often start with some agent, some AI, which might, again, be a physical robot, if you’re imagining a physical robot moving around, but it can also just be a program. And our agent is situated in their environment, where the environment is where they’re going to make their actions, and it’s what’s going to give them rewards or punishments for various actions that they’re in. So for example, the environment is going to start off by putting our agent inside of a state. Our agent has some state that, in a game, might be the state of the game that the agent is playing. In a world that the agent is exploring might be some position inside of a grid representing the world that they’re exploring. But the agent is in some sort of state. And in that state, the agent needs to choose to take an action. The agent likely has multiple actions they can choose from, but they pick an action. So they take an action in a particular state. And as a result of that, the agent will generally get two things in response as we model them. The agent gets a new state that they find themselves in. After being in this state, taking one action, they end up in some other state. And they’re also given some sort of numerical reward, positive meaning reward, meaning it was a good thing, negative generally meaning they did something bad, they received some sort of punishment. And that is all the information the agent has. It’s told what state it’s in. It makes some sort of action. And based on that, it ends up in another state. And it ends up getting some particular reward. And it needs to learn, based on that information, what actions to begin to take in the future. And so you could imagine generalizing this to a lot of different situations. This is oftentimes how you train if you’ve ever seen those robots that are now able to walk around the way humans do. It would be quite difficult to program the robot in exactly the right way to get it to walk the way humans do. You could instead train it through reinforcement learning, give it some sort of numerical reward every time it does something good, like take steps forward, and punish it every time it does something bad, like fall over, and then let the AI just learn based on that sequence of rewards, based on trying to take various different actions. You can begin to have the agent learn what to do in the future and what not to do. So in order to begin to formalize this, the first thing we need to do is formalize this notion of what we mean about states and actions and rewards, like what does this world look like? And oftentimes, we’ll formulate this world as what’s known as a Markov decision process, similar in spirit to Markov chains, which you might recall from before. But a Markov decision process is a model that we can use for decision making, for an agent trying to make decisions in its environment. And it’s a model that allows us to represent the various different states that an agent can be in, the various different actions that they can take, and also what the reward is for taking one action as opposed to another action. So what then does it actually look like? Well, if you recall a Markov chain from before, a Markov chain looked a little something like this, where we had a whole bunch of these individual states, and each state immediately transitioned to another state based on some probability distribution. We saw this in the context of the weather before, where if it was sunny, we said with some probability, it’ll be sunny the next day. With some other probability, it’ll be rainy, for example. But we could also imagine generalizing this. It’s not just sun and rain anymore. We just have these states, where one state leads to another state according to some probability distribution. But in this original model, there was no agent that had any control over this process. It was just entirely probability based, where with some probability, we moved to this next state. But maybe it’s going to be some other state with some other probability. What we’ll now have is the ability for the agent in this state to choose from a set of actions, where maybe instead of just one path forward, they have three different choices of actions that each lead up down different paths. And even this is a bit of an oversimplification, because in each of these states, you might imagine more branching points where there are more decisions that can be taken as well. So we’ve extended the Markov chain to say that from a state, you now have available action choices. And each of those actions might be associated with its own probability distribution of going to various different states. Then in addition, we’ll add another extension, where any time you move from a state, taking an action, going into this other state, we can associate a reward with that outcome, saying either r is positive, meaning some positive reward, or r is negative, meaning there was some sort of punishment. And this then is what we’ll consider to be a Markov decision process. That a Markov decision process has some initial set of states, of states in the world that we can be in. We have some set of actions that, given a state, I can say, what are the actions that are available to me in that state, an action that I can choose from? Then we have some transition model. The transition model before just said that, given my current state, what is the probability that I end up in that next state or this other state? The transition model now has effectively two things we’re conditioning on. We’re saying, given that I’m in this state and that I take this action, what’s the probability that I end up in this next state? Now maybe we live in a very deterministic world in this Markov decision process. We’re given a state and given an action. We know for sure what next state we’ll end up in. But maybe there’s some randomness in the world that when you take in a state and you take an action, you might not always end up in the exact same state. There might be some probabilities involved there as well. The Markov decision process can handle both of those possible cases. And then finally, we have a reward function, generally called r, that in this case says, what is the reward for being in this state, taking this action, and then getting to s prime this next state? So I’m in this original state. I take this action. I get to this next state. What is the reward for doing that process? And you can add up these rewards every time you take an action to get the total amount of rewards that an agent might get from interacting in a particular environment modeled using this Markov decision process. So what might this actually look like in practice? Well, let’s just create a little simulated world here where I have this agent that is just trying to navigate its way. This agent is this yellow dot here, like a robot in the world, trying to navigate its way through this grid. And ultimately, it’s trying to find its way to the goal. And if it gets to the green goal, then it’s going to get some sort of reward. But then we might also have some red squares that are places where you get some sort of punishment, some bad place where we don’t want the agent to go. And if it ends up in the red square, then our agent is going to get some sort of punishment as a result of that. But the agent originally doesn’t know all of these details. It doesn’t know that these states are associated with punishments. But maybe it does know that this state is associated with a reward. Maybe it doesn’t. But it just needs to sort of interact with the environment to try and figure out what to do and what not to do. So the first thing the agent might do is, given no additional information, if it doesn’t know what the punishments are, it doesn’t know where the rewards are, it just might try and take an action. And it takes an action and ends up realizing that it got some sort of punishment. And so what does it learn from that experience? Well, it might learn that when you’re in this state in the future, don’t take the action move to the right, that that is a bad action to take. That in the future, if you ever find yourself back in the state, don’t take this action of going to the right when you’re in this particular state, because that leads to punishment. That might be the intuition at least. And so you could try doing other actions. You move up, all right, that didn’t lead to any immediate rewards. Maybe try something else. Then maybe try something else. And all right, now you found that you got another punishment. And so you learn something from that experience. So the next time you do this whole process, you know that if you ever end up in this square, you shouldn’t take the down action, because being in this state and taking that action ultimately leads to some sort of punishment, a negative reward, in other words. And this process repeats. You might imagine just letting our agent explore the world, learning over time what states tend to correspond with poor actions, learning over time what states correspond with poor actions, until eventually, if it tries enough things randomly, it might find that eventually when you get to this state, if you take the up action in this state, it might find that you actually get a reward from that. And what it can learn from that is that if you’re in this state, you should take the up action, because that leads to a reward. And over time, you can also learn that if you’re in this state, you should take the left action, because that leads to this state that also lets you eventually get to the reward. So you begin to learn over time not only which actions are good in particular states, but also which actions are bad, such that once you know some sequence of good actions that leads you to some sort of reward, our agent can just follow those instructions, follow the experience that it has learned. We didn’t tell the agent what the goal was. We didn’t tell the agent where the punishments were. But the agent can begin to learn from this experience and learn to begin to perform these sorts of tasks better in the future. And so let’s now try to formalize this idea, formalize the idea that we would like to be able to learn in this state taking this action, is that a good thing or a bad thing? There are lots of different models for reinforcement learning. We’re just going to look at one of them today. And the one that we’re going to look at is a method known as Q-learning. And what Q-learning is all about is about learning a function, a function Q, that takes inputs S and A, where S is a state and A is an action that you take in that state. And what this Q function is going to do is it is going to estimate the value. How much reward will I get from taking this action in this state? Originally, we don’t know what this Q function should be. But over time, based on experience, based on trying things out and seeing what the result is, I would like to try and learn what Q of SA is for any particular state and any particular action that I might take in that state. So what is the approach? Well, the approach originally is we’ll start with Q SA equal to 0 for all states S and for all actions A. That initially, before I’ve ever started anything, before I’ve had any experiences, I don’t know the value of taking any action in any given state. So I’m going to assume that the value is just 0 all across the board. But then as I interact with the world, as I experience rewards or punishments, or maybe I go to a cell where I don’t get either reward or a punishment, I want to somehow update my estimate of Q SA. I want to continually update my estimate of Q SA based on the experiences and rewards and punishments that I’ve received, such that in the future, my knowledge of what actions are good and what states will be better. So when we take an action and receive some sort of reward, I want to estimate the new value of Q SA. And I estimate that based on a couple of different things. I estimate it based on the reward that I’m getting from taking this action and getting into the next state. But assuming the situation isn’t over, assuming there are still future actions that I might take as well, I also need to take into account the expected future rewards. That if you imagine an agent interacting with the environment, then sometimes you’ll take an action and get a reward, but then you can keep taking more actions and get more rewards, that these both are relevant, both the current reward I’m getting from this current step and also my future reward. And it might be the case that I’ll want to take a step that doesn’t immediately lead to a reward, because later on down the line, I know it will lead to more rewards as well. So there’s a balancing act between current rewards that the agent experiences and future rewards that the agent experiences as well. And then we need to update QSA. So we estimate the value of QSA based on the current reward and the expected future rewards. And then we need to update this Q function to take into account this new estimate. Now, we already, as we go through this process, we’ll already have an estimate for what we think the value is. Now we have a new estimate, and then somehow we need to combine these two estimates together, and we’ll look at more formal ways that we can actually begin to do that. So to actually show you what this formula looks like, here is the approach we’ll take with Q learning. We’re going to, again, start with Q of S and A being equal to 0 for all states. And then every time we take an action A in state S and observer reward R, we’re going to update our value, our estimate, for Q of SA. And the idea is that we’re going to figure out what the new value estimate is minus what our existing value estimate is. And so we have some preconceived notion for what the value is for taking this action in this state. Maybe our expectation is we currently think the value is 10. But then we’re going to estimate what we now think it’s going to be. Maybe the new value estimate is something like 20. So there’s a delta of 10 that our new value estimate is 10 points higher than what our current value estimate happens to be. And so we have a couple of options here. We need to decide how much we want to adjust our current expectation of what the value is of taking this action in this particular state. And what that difference is, how much we add or subtract from our existing notion of how much do we expect the value to be, is dependent on this parameter alpha, also called a learning rate. And alpha represents, in effect, how much we value new information compared to how much we value old information. An alpha value of 1 means we really value new information. But if we have a new estimate, then it doesn’t matter what our old estimate is. We’re only going to consider our new estimate because we always just want to take into consideration our new information. So the way that works is that if you imagine alpha being 1, well, then we’re taking the old value of QSA and then adding 1 times the new value minus the old value. And that just leaves us with the new value. So when alpha is 1, all we take into consideration is what our new estimate happens to be. But over time, as we go through a lot of experiences, we already have some existing information. We might have tried taking this action nine times already. And now we just tried it a 10th time. And we don’t only want to consider this 10th experience. I also want to consider the fact that my prior nine experiences, those were meaningful, too. And that’s data I don’t necessarily want to lose. And so this alpha controls that decision, controls how important is the new information. 0 would mean ignore all the new information. Just keep this Q value the same. 1 means replace the old information entirely with the new information. And somewhere in between, keep some sort of balance between these two values. We can put this equation a little bit more formally as well. The old value estimate is our old estimate for what the value is of taking this action in a particular state. That’s just Q of SNA. So we have it once here, and we’re going to add something to it. We’re going to add alpha times the new value estimate minus the old value estimate. But the old value estimate, we just look up by calling this Q function. And what then is the new value estimate? Based on this experience we have just taken, what is our new estimate for the value of taking this action in this particular state? Well, it’s going to be composed of two parts. It’s going to be composed of what reward did I just get from taking this action in this state. And then it’s going to be, what can I expect my future rewards to be from this point forward? So it’s going to be R, some reward I’m getting right now, plus whatever I estimate I’m going to get in the future. And how do I estimate what I’m going to get in the future? Well, it’s a bit of another call to this Q function. It’s going to be take the maximum across all possible actions I could take next and say, all right, of all of these possible actions I could take, which one is going to have the highest reward? And so this then looks a little bit complicated. This is going to be our notion for how we’re going to perform this kind of update. I have some estimate, some old estimate, for what the value is of taking this action in this state. And I’m going to update it based on new information that I experience some reward. I predict what my future reward is going to be. And using that I update what I estimate the reward will be for taking this action in this particular state. And there are other additions you might make to this algorithm as well. Sometimes it might not be the case that future rewards you want to wait equally to current rewards. Maybe you want an agent that values reward now over reward later. And so sometimes you can even add another term in here, some other parameter, where you discount future rewards and say future rewards are not as valuable as rewards immediately. That getting reward in the current time step is better than waiting a year and getting rewards later. But that’s something up to the programmer to decide what that parameter ought to be. But the big picture idea of this entire formula is to say that every time we experience some new reward, we take that into account. We update our estimate of how good is this action. And then in the future, we can make decisions based on that algorithm. Once we have some good estimate for every state and for every action, what the value is of taking that action, then we can do something like implement a greedy decision making policy. That if I am in a state and I want to know what action should I take in that state, well, then I consider for all of my possible actions, what is the value of QSA? What is my estimated value of taking that action in that state? And I will just pick the action that has the highest value after I evaluate that expression. So I pick the action that has the highest value. And based on that, that tells me what action I should take. At any given state that I’m in, I can just greedily say across all my actions, this action gives me the highest expected value. And so I’ll go ahead and choose that action as the action that I take as well. But there is a downside to this kind of approach. And then downside comes up in a situation like this, where we know that there is some solution that gets me to the reward. And our agent has been able to figure that out. But it might not necessarily be the best way or the fastest way. If the agent is allowed to explore a little bit more, it might find that it can get the reward faster by taking some other route instead, by going through this particular path that is a faster way to get to that ultimate goal. And maybe we would like for the agent to be able to figure that out as well. But if the agent always takes the actions that it knows to be best, well, when it gets to this particular square, it doesn’t know that this is a good action because it’s never really tried it. But it knows that going down eventually leads its way to this reward. So it might learn in the future that it should just always take this route and it’s never going to explore and go along that route instead. So in reinforcement learning, there is this tension between exploration and exploitation. And exploitation generally refers to using knowledge that the AI already has. The AI already knows that this is a move that leads to reward. So we’ll go ahead and use that move. And exploration is all about exploring other actions that we may not have explored as thoroughly before because maybe one of these actions, even if I don’t know anything about it, might lead to better rewards faster or to more rewards in the future. And so an agent that only ever exploits information and never explores might be able to get reward, but it might not maximize its rewards because it doesn’t know what other possibilities are out there, possibilities that we only know about by taking advantage of exploration. And so how can we try and address this? Well, one possible solution is known as the Epsilon greedy algorithm, where we set Epsilon equal to how often we want to just make a random move, where occasionally we will just make a random move in order to say, let’s try to explore and see what happens. And then the logic of the algorithm will be with probability 1 minus Epsilon, choose the estimated best move. In a greedy case, we’d always choose the best move. But in Epsilon greedy, we’re most of the time going to choose the best move or sometimes going to choose the best move. But sometimes with probability Epsilon, we’re going to choose a random move instead. So every time we’re faced with the ability to take an action, sometimes we’re going to choose the best move. Sometimes we’re just going to choose a random move. So this type of algorithm can be quite powerful in a reinforcement learning context by not always just choosing the best possible move right now, but sometimes, especially early on, allowing yourself to make random moves that allow you to explore various different possible states and actions more, and maybe over time, you might decrease your value of Epsilon. More and more often, choosing the best move after you’re more confident that you’ve explored what all of the possibilities actually are. So we can put this into practice. And one very common application of reinforcement learning is in game playing, that if you want to teach an agent how to play a game, you just let the agent play the game a whole bunch. And then the reward signal happens at the end of the game. When the game is over, if our AI won the game, it gets a reward of like 1, for example. And if it lost the game, it gets a reward of negative 1. And from that, it begins to learn what actions are good and what actions are bad. You don’t have to tell the AI what’s good and what’s bad, but the AI figures it out based on that reward. Winning the game is some signal, losing the game is some signal, and based on all of that, it begins to figure out what decisions it should actually make. So one very simple game, which you may have played before, is a game called Nim. And in the game of Nim, you’ve got a whole bunch of objects in a whole bunch of different piles, where here I’ve represented each pile as an individual row. So you’ve got one object in the first pile, three in the second pile, five in the third pile, seven in the fourth pile. And the game of Nim is a two player game where players take turns removing objects from piles. And the rule is that on any given turn, you were allowed to remove as many objects as you want from any one of these piles, any one of these rows. You have to remove at least one object, but you remove as many as you want from exactly one of the piles. And whoever takes the last object loses. So player one might remove four from this pile here. Player two might remove four from this pile here. So now we’ve got four piles left, one, three, one, and three. Player one might remove the entirety of the second pile. Player two, if they’re being strategic, might remove two from the third pile. Now we’ve got three piles left, each with one object left. Player one might remove one from one pile. Player two removes one from the other pile. And now player one is left with choosing this one object from the last pile, at which point player one loses the game. So fairly simple game. Piles of objects, any turn you choose how many objects to remove from a pile, whoever removes the last object loses. And this is the type of game you could encode into an AI fairly easily, because the states are really just four numbers. Every state is just how many objects in each of the four piles. And the actions are things like, how many am I going to remove from each one of these individual piles? And the reward happens at the end, that if you were the player that had to remove the last object, then you get some sort of punishment. But if you were not, and the other player had to remove the last object, well, then you get some sort of reward. So we could actually try and show a demonstration of this, that I’ve implemented an AI to play the game of Nim. All right, so here, what we’re going to do is create an AI as a result of training the AI on some number of games, that the AI is going to play against itself, where the idea is the AI will play games against itself, learn from each of those experiences, and learn what to do in the future. And then I, the human, will play against the AI. So initially, we’ll say train zero times, meaning we’re not going to let the AI play any practice games against itself in order to learn from its experiences. We’re just going to see how well it plays. And it looks like there are four piles. I can choose how many I remove from any one of the piles. So maybe from pile three, I will remove five objects, for example. So now, AI chose to take one item from pile zero. So I’m left with these piles now, for example. And so here, I could choose maybe to say, I would like to remove from pile two, I’ll remove all five of them, for example. And so AI chose to take two away from pile one. Now I’m left with one pile that has one object, one pile that has two objects. So from pile three, I will remove two objects. And now I’ve left the AI with no choice but to take that last one. And so the game is over, and I was able to win. But I did so because the AI was really just playing randomly. It didn’t have any prior experience that it was using in order to make these sorts of judgments. Now let me let the AI train itself on 10,000 games. I’m going to let the AI play 10,000 games of nim against itself. Every time it wins or loses, it’s going to learn from that experience and learn in the future what to do and what not to do. So here then, I’ll go ahead and run this again. And now you see the AI running through a whole bunch of training games, 10,000 training games against itself. And now it’s going to let me make these sorts of decisions. So now I’m going to play against the AI. Maybe I’ll remove one from pile three. And the AI took everything from pile three, so I’m left with three piles. I’ll go ahead and from pile two maybe remove three items. And the AI removes one item from pile zero. I’m left with two piles, each of which has two items in it. I’ll remove one from pile one, I guess. And the AI took two from pile two, leaving me with no choice but to take one away from pile one. So it seems like after playing 10,000 games of nim against itself, the AI has learned something about what states and what actions tend to be good and has begun to learn some sort of pattern for how to predict what actions are going to be good and what actions are going to be bad in any given state. So reinforcement learning can be a very powerful technique for achieving these sorts of game-playing agents, agents that are able to play a game well just by learning from experience, whether that’s playing against other people or by playing against itself and learning from those experiences as well. Now, nim is a bit of an easy game to use reinforcement learning for because there are so few states. There are only states that are as many as how many different objects are in each of these various different piles. You might imagine that it’s going to be harder if you think of a game like chess or games where there are many, many more states and many, many more actions that you can imagine taking, where it’s not going to be as easy to learn for every state and for every action what the value is going to be. So oftentimes in that case, we can’t necessarily learn exactly what the value is for every state and for every action, but we can approximate it. So much as we saw with minimax, so we could use a depth-limiting approach to stop calculating at a certain point in time, we can do a similar type of approximation known as function approximation in a reinforcement learning context where instead of learning a value of q for every state and every action, we just have some function that estimates what the value is for taking this action in this particular state that might be based on various different features of the state that the agent happens to be in, where you might have to choose what those features actually are. But you can begin to learn some patterns that generalize beyond one specific state and one specific action that you can begin to learn if certain features tend to be good things or bad things. Reinforcement learning can allow you, using a very similar mechanism, to generalize beyond one particular state and say, if this other state looks kind of like this state, then maybe the similar types of actions that worked in one state will also work in another state as well. And so this type of approach can be quite helpful as you begin to deal with reinforcement learning that exist in larger and larger state spaces where it’s just not feasible to explore all of the possible states that could actually exist. So there, then, are two of the main categories of reinforcement learning. Supervised learning, where you have labeled input and output pairs, and reinforcement learning, where an agent learns from rewards or punishments that it receives. The third major category of machine learning that we’ll just touch on briefly is known as unsupervised learning. And unsupervised learning happens when we have data without any additional feedback, without labels, that in the supervised learning case, all of our data had labels. We labeled the data point with whether that was a rainy day or not rainy day. And using those labels, we were able to infer what the pattern was. Or we labeled data as a counterfeit banknote or not a counterfeit. And using those labels, we were able to draw inferences and patterns to figure out what does a banknote look like versus not. In unsupervised learning, we don’t have any access to any of those labels. But we still would like to learn some of those patterns. And one of the tasks that you might want to perform in unsupervised learning is something like clustering, where clustering is just the task of, given some set of objects, organize it into distinct clusters, groups of objects that are similar to one another. And there’s lots of applications for clustering. It comes up in genetic research, where you might have a whole bunch of different genes and you want to cluster them into similar genes if you’re trying to analyze them across a population or across species. It comes up in an image if you want to take all the pixels of an image, cluster them into different parts of the image. Comes a lot up in market research if you want to divide your consumers into different groups so you know which groups to target with certain types of product advertisements, for example, and a number of other contexts as well in which clustering can be very applicable. One technique for clustering is an algorithm known as k-means clustering. And what k-means clustering is going to do is it is going to divide all of our data points into k different clusters. And it’s going to do so by repeating this process of assigning points to clusters and then moving around those clusters at centers. We’re going to define a cluster by its center, the middle of the cluster, and then assign points to that cluster based on which center is closest to that point. And I’ll show you an example of that now. Here, for example, I have a whole bunch of unlabeled data, just various data points that are in some sort of graphical space. And I would like to group them into various different clusters. But I don’t know how to do that originally. And let’s say I want to assign like three clusters to this group. And you have to choose how many clusters you want in k-means clustering that you could try multiple and see how well those values perform. But I’ll start just by randomly picking some places to put the centers of those clusters. Maybe I have a blue cluster, a red cluster, and a green cluster. And I’m going to start with the centers of those clusters just being in these three locations here. And what k-means clustering tells us to do is once I have the centers of the clusters, assign every point to a cluster based on which cluster center it is closest to. So we end up with something like this, where all of these points are closer to the blue cluster center than any other cluster center. All of these points here are closer to the green cluster center than any other cluster center. And then these two points plus these points over here, those are all closest to the red cluster center instead. So here then is one possible assignment of all these points to three different clusters. But it’s not great that it seems like in this red cluster, these points are kind of far apart. In this green cluster, these points are kind of far apart. It might not be my ideal choice of how I would cluster these various different data points. But k-means clustering is an iterative process that after I do this, there is a next step, which is that after I’ve assigned all of the points to the cluster center that it is nearest to, we are going to re-center the clusters, meaning take the cluster centers, these diamond shapes here, and move them to the middle, or the average, effectively, of all of the points that are in that cluster. So we’ll take this blue point, this blue center, and go ahead and move it to the middle or to the center of all of the points that were assigned to the blue cluster, moving it slightly to the right in this case. And we’ll do the same thing for red. We’ll move the cluster center to the middle of all of these points, weighted by how many points there are. There are more points over here, so the red center ends up moving a little bit further that way. And likewise, for the green center, there are many more points on this side of the green center. So the green center ends up being pulled a little bit further in this direction. So we re-center all of the clusters, and then we repeat the process. We go ahead and now reassign all of the points to the cluster center that they are now closest to. And now that we’ve moved around the cluster centers, these cluster assignments might change. That this point originally was closer to the red cluster center, but now it’s actually closer to the blue cluster center. Same goes for this point as well. And these three points that were originally closer to the green cluster center are now closer to the red cluster center instead. So we can reassign what colors or which clusters each of these data points belongs to, and then repeat the process again, moving each of these cluster means and the middles of the clusterism to the mean, the average, of all of the other points that happen to be there, and repeat the process again. Go ahead and assign each of the points to the cluster that they are closest to. So once we reach a point where we’ve assigned all the points to clusters to the cluster that they are nearest to, and nothing changed, we’ve reached a sort of equilibrium in this situation, where no points are changing their allegiance. And as a result, we can declare this algorithm is now over. And we now have some assignment of each of these points into three different clusters. And it looks like we did a pretty good job of trying to identify which points are more similar to one another than they are to points in other groups. So we have the green cluster down here, this blue cluster here, and then this red cluster over there as well. And we did so without any access to some labels to tell us what these various different clusters were. We just used an algorithm in an unsupervised sense without any of those labels to figure out which points belonged to which categories. And again, lots of applications for this type of clustering technique. And there are many more algorithms in each of these various different fields within machine learning, supervised and reinforcement and unsupervised. But those are many of the big picture foundational ideas that underlie a lot of these techniques, where these are the problems that we’re trying to solve. And we try and solve those problems using a number of different methods of trying to take data and learn patterns in that data, whether that’s trying to find neighboring data points that are similar or trying to minimize some sort of loss function or any number of other techniques that allow us to begin to try to solve these sorts of problems. That then was a look at some of the principles that are at the foundation of modern machine learning, this ability to take data and learn from that data so that the computer can perform a task even if they haven’t explicitly been given instructions in order to do so. Next time, we’ll continue this conversation about machine learning, looking at other techniques we can use for solving these sorts of problems. We’ll see you then. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Now, last time, we took a look at machine learning, a set of techniques that computers can use in order to take a set of data and learn some patterns inside of that data, learn how to perform a task even if we the programmers didn’t give the computer explicit instructions for how to perform that task. Today, we transition to one of the most popular techniques and tools within machine learning, that of neural networks. And neural networks were inspired as early as the 1940s by researchers who were thinking about how it is that humans learn, studying neuroscience in the human brain and trying to see whether or not we could apply those same ideas to computers as well and model computer learning off of human learning. So how is the brain structured? Well, very simply put, the brain consists of a whole bunch of neurons. And those neurons are connected to one another and communicate with one another in some way. In particular, if you think about the structure of a biological neural network, something like this, there are a couple of key properties that scientists observed. One was that these neurons are connected to each other and receive electrical signals from one another, that one neuron can propagate electrical signals to another neuron. And another point is that neurons process those input signals and then can be activated, that a neuron becomes activated at a certain point and then can propagate further signals onto neurons in the future. And so the question then became, could we take this biological idea of how it is that humans learn with brains and with neurons and apply that to a machine as well, in effect designing an artificial neural network, or an ANN, which will be a mathematical model for learning that is inspired by these biological neural networks? And what artificial neural networks will allow us to do is they will first be able to model some sort of mathematical function. Every time you look at a neural network, which we’ll see more of later today, each one of them is really just some mathematical function that is mapping certain inputs to particular outputs based on the structure of the network, that depending on where we place particular units inside of this neural network, that’s going to determine how it is that the network is going to function. And in particular, artificial neural networks are going to lend themselves to a way that we can learn what the network’s parameters should be. We’ll see more on that in just a moment. But in effect, we want a model such that it is easy for us to be able to write some code that allows for the network to be able to figure out how to model the right mathematical function given a particular set of input data. So in order to create our artificial neural network, instead of using biological neurons, we’re just going to use what we’re going to call units, units inside of a neural network, which we can represent kind of like a node in a graph, which will here be represented just by a blue circle like this. And these artificial units, these artificial neurons, can be connected to one another. So here, for instance, we have two units that are connected by this edge inside of this graph, effectively. And so what we’re going to do now is think of this idea as some sort of mapping from inputs to outputs. So we have one unit that is connected to another unit that we might think of this side of the input and that side of the output. And what we’re trying to do then is to figure out how to solve a problem, how to model some sort of mathematical function. And this might take the form of something we saw last time, which was something like we have certain inputs, like variables x1 and x2. And given those inputs, we want to perform some sort of task, a task like predicting whether or not it’s going to rain. And ideally, we’d like some way, given these inputs, x1 and x2, which stand for some sort of variables to do with the weather, we would like to be able to predict, in this case, a Boolean classification. Is it going to rain, or is it not going to rain? And we did this last time by way of a mathematical function. We defined some function, h, for our hypothesis function, that took as input x1 and x2, the two inputs that we cared about processing, in order to determine whether we thought it was going to rain or whether we thought it was not going to rain. The question then becomes, what does this hypothesis function do in order to make that determination? And we decided last time to use a linear combination of these input variables to determine what the output should be. So our hypothesis function was equal to something like this. Weight 0 plus weight 1 times x1 plus weight 2 times x2. So what’s going on here is that x1 and x2, those are input variables, the inputs to this hypothesis function. And each of those input variables is being multiplied by some weight, which is just some number. So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2. And we have this additional weight, weight 0, that doesn’t get multiplied by an input variable at all, that just serves to either move the function up or move the function’s value down. You can think of this as either a weight that’s just multiplied by some dummy value, like the number 1. It’s multiplied by 1, and so it’s not multiplied by anything. Or sometimes, you’ll see in the literature, people call this variable weight 0 a bias, so that you can think of these variables as slightly different. We have weights that are multiplied by the input, and we separately add some bias to the result as well. You’ll hear both of those terminologies used when people talk about neural networks and machine learning. So in effect, what we’ve done here is that in order to define a hypothesis function, we just need to decide and figure out what these weights should be to determine what values to multiply by our inputs to get some sort of result. Of course, at the end of this, what we need to do is make some sort of classification, like rainy or not rainy. And to do that, we use some sort of function that defines some sort of threshold. And so we saw, for instance, the step function, which is defined as 1 if the result of multiplying the weights by the inputs is at least 0, otherwise it’s 0. And you can think of this line down the middle as kind of like a dotted line. Effectively, it stays at 0 all the way up to one point, and then the function steps or jumps up to 1. So it’s 0 before it reaches some threshold, and then it’s 1 after it reaches a particular threshold. And so this was one way we could define what will come to call an activation function, a function that determines when it is that this output becomes active, changes to 1 instead of being a 0. But we also saw that if we didn’t just want a purely binary classification, we didn’t want purely 1 or 0, but we wanted to allow for some in-between real numbered values, we could use a different function. And there are a number of choices, but the one that we looked at was the logistic sigmoid function that has sort of an s-shaped curve, where we could represent this as a probability that may be somewhere in between the probability of rain or something like 0.5. Maybe a little bit later, the probability of rain is 0.8. And so rather than just have a binary classification of 0 or 1, we could allow for numbers that are in between as well. And it turns out there are many other different types of activation functions, where an activation function just takes the output of multiplying the weights together and adding that bias, and then figuring out what the actual output should be. Another popular one is the rectified linear unit, otherwise known as ReLU. And the way that works is that it just takes its input and takes the maximum of that input and 0. So if it’s positive, it remains unchanged. But if it’s 0, if it’s negative, it goes ahead and levels out at 0. And there are other activation functions that we could choose as well. But in short, each of these activation functions, you can just think of as a function that gets applied to the result of all of this computation. We take some function g and apply it to the result of all of that calculation. And this then is what we saw last time, the way of defining some hypothesis function that takes in inputs, calculate some linear combination of those inputs, and then passes it through some sort of activation function to get our output. And this actually turns out to be the model for the simplest of neural networks, that we’re going to instead represent this mathematical idea graphically by using a structure like this. Here then is a neural network that has two inputs. We can think of this as x1 and this as x2. And then one output, which you can think of as classifying whether or not we think it’s going to rain or not rain, for example, in this particular instance. And so how exactly does this model work? Well, each of these two inputs represents one of our input variables, x1 and x2. And notice that these inputs are connected to this output via these edges, which are going to be defined by their weights. So these edges each have a weight associated with them, weight 1 and weight 2. And then this output unit, what it’s going to do is it is going to calculate an output based on those inputs and based on those weights. This output unit is going to multiply all the inputs by their weights, add in this bias term, which you can think of as an extra w0 term that gets added into it, and then we pass it through an activation function. So this then is just a graphical way of representing the same idea we saw last time just mathematically. And we’re going to call this a very simple neural network. And we’d like for this neural network to be able to learn how to calculate some function, that we want some function for the neural network to learn. And the neural network is going to learn what should the values of w0, w1, and w2 be? What should the activation function be in order to get the result that we would expect? So we can actually take a look at an example of this. What then is a very simple function that we might calculate? Well, if we recall back from when we were looking at propositional logic, one of the simplest functions we looked at was something like the or function that takes two inputs, x and y, and outputs 1, otherwise known as true, if either one of the inputs or both of them are 1, and outputs of 0 if both of the inputs are 0 or false. So this then is the or function. And this was the truth table for the or function, that as long as either of the inputs are 1, the output of the function is 1, and the only case where the output is 0 is where both of the inputs are 0. So the question is, how could we take this and train a neural network to be able to learn this particular function? What would those weights look like? Well, we could do something like this. Here’s our neural network. And I’ll propose that in order to calculate the or function, we’re going to use a value of 1 for each of the weights. And we’ll use a bias of negative 1. And then we’ll just use this step function as our activation function. How then does this work? Well, if I wanted to calculate something like 0 or 0, which we know to be 0 because false or false is false, then what are we going to do? Well, our output unit is going to calculate this input multiplied by the weight, 0 times 1, that’s 0. Same thing here, 0 times 1, that’s 0. And we’ll add to that the bias minus 1. So that’ll give us a result of negative 1. If we plot that on our activation function, negative 1 is here. It’s before the threshold, which means either 0 or 1. It’s only 1 after the threshold. Since negative 1 is before the threshold, the output that this unit provides is going to be 0. And that’s what we would expect it to be, that 0 or 0 should be 0. What if instead we had had 1 or 0, where this is the number 1? Well, in this case, in order to calculate what the output is going to be, we again have to do this weighted sum, 1 times 1, that’s 1. 0 times 1, that’s 0. Sum of that so far is 1. Add negative 1 to that. Well, then the output is 0. And if we plot 0 on the step function, 0 ends up being here. It’s just at the threshold. And so the output here is going to be 1, because the output of 1 or 0, that’s 1. So that’s what we would expect as well. And just for one more example, if I had 1 or 1, what would the result be? Well, 1 times 1 is 1. 1 times 1 is 1. The sum of those is 2. I add the bias term to that. I get the number 1. 1 plotted on this graph is way over there. That’s well beyond the threshold. And so this output is going to be 1 as well. The output is always 0 or 1, depending on whether or not we’re past the threshold. And this neural network then models the OR function, a very simple function, definitely. But it still is able to model it correctly. If I give it the inputs, it will tell me what x1 or x2 happens to be. And you could imagine trying to do this for other functions as well. A function like the AND function, for instance, that takes two inputs and calculates whether both x and y are true. So if x is 1 and y is 1, then the output of x and y is 1. But in all the other cases, the output is 0. How could we model that inside of a neural network as well? Well, it turns out we could do it in the same way, except instead of negative 1 as the bias, we can use negative 2 as the bias instead. What does that end up looking like? Well, if I had 1 and 1, that should be 1, because 1 true and true is equal to true. Well, I take 1 times 1, that’s 1. 1 times 1 is 1. I get a total sum of 2 so far. Now I add the bias of negative 2, and I get the value 0. And 0, when I plot it on the activation function, is just past that threshold, and so the output is going to be 1. But if I had any other input, for example, like 1 and 0, well, the weighted sum of these is 1 plus 0 is going to be 1. Minus 2 is going to give us negative 1, and negative 1 is not past that threshold, and so the output is going to be 0. So those then are some very simple functions that we can model using a neural network that has two inputs and one output, where our goal is to be able to figure out what those weights should be in order to determine what the output should be. And you could imagine generalizing this to calculate more complex functions as well, that maybe, given the humidity and the pressure, we want to calculate what’s the probability that it’s going to rain, for example. Or we might want to do a regression-style problem. We’re given some amount of advertising, and given what month it is maybe, we want to predict what our expected sales are going to be for that particular month. So you could imagine these inputs and outputs being different as well. And it turns out that in some problems, we’re not just going to have two inputs, and the nice thing about these neural networks is that we can compose multiple units together, make our networks more complex just by adding more units into this particular neural network. So the network we’ve been looking at has two inputs and one output. But we could just as easily say, let’s go ahead and have three inputs in there, or have even more inputs, where we could arbitrarily decide however many inputs there are to our problem, all going to be calculating some sort of output that we care about figuring out the value of. How then does the math work for figuring out that output? Well, it’s going to work in a very similar way. In the case of two inputs, we had two weights indicated by these edges, and we multiplied the weights by the numbers, adding this bias term. And we’ll do the same thing in the other cases as well. If I have three inputs, you’ll imagine multiplying each of these three inputs by each of these weights. If I had five inputs instead, we’re going to do the same thing. Here I’m saying sum up from 1 to 5, xi multiplied by weight i. So take each of the five input variables, multiply them by their corresponding weight, and then add the bias to that. So this would be a case where there are five inputs into this neural network, for example. But there could be more, arbitrarily many nodes that we want inside of this neural network, where each time we’re just going to sum up all of those input variables multiplied by their weight and then add the bias term at the very end. And so this allows us to be able to represent problems that have even more inputs just by growing the size of our neural network. Now, the next question we might ask is a question about how it is that we train these neural networks. In the case of the or function and the and function, they were simple enough functions that I could just tell you, like here, what the weights should be. And you could probably reason through it yourself what the weights should be in order to calculate the output that you want. But in general, with functions like predicting sales or predicting whether or not it’s going to rain, these are much trickier functions to be able to figure out. We would like the computer to have some mechanism of calculating what it is that the weights should be, how it is to set the weights so that our neural network is able to accurately model the function that we care about trying to estimate. And it turns out that the strategy for doing this, inspired by the domain of calculus, is a technique called gradient descent. And what gradient descent is, it is an algorithm for minimizing loss when you’re training a neural network. And recall that loss refers to how bad our hypothesis function happens to be, that we can define certain loss functions. And we saw some examples of loss functions last time that just give us a number for any particular hypothesis, saying, how poorly does it model the data? How many examples does it get wrong? How are they worse or less bad as compared to other hypothesis functions that we might define? And this loss function is just a mathematical function. And when you have a mathematical function, in calculus what you could do is calculate something known as the gradient, which you can think of as like a slope. It’s the direction the loss function is moving at any particular point. And what it’s going to tell us is, in which direction should we be moving these weights in order to minimize the amount of loss? And so generally speaking, we won’t get into the calculus of it. But the high level idea for gradient descent is going to look something like this. If we want to train a neural network, we’ll go ahead and start just by choosing the weights randomly. Just pick random weights for all of the weights in the neural network. And then we’ll use the input data that we have access to in order to train the network, in order to figure out what the weights should actually be. So we’ll repeat this process again and again. The first step is we’re going to calculate the gradient based on all of the data points. So we’ll look at all the data and figure out what the gradient is at the place where we currently are for the current setting of the weights, which means in which direction should we move the weights in order to minimize the total amount of loss, in order to make our solution better. And once we’ve calculated that gradient, which direction we should move in the loss function, well, then we can just update those weights according to the gradient. Take a small step in the direction of those weights in order to try to make our solution a little bit better. And the size of the step that we take, that’s going to vary. And you can choose that when you’re training a particular neural network. But in short, the idea is going to be take all the data points, figure out based on those data points in what direction the weights should move, and then move the weights one small step in that direction. And if you repeat that process over and over again, adjusting the weights a little bit at a time based on all the data points, eventually you should end up with a pretty good solution to trying to solve this sort of problem. At least that’s what we would hope to happen. Now, if you look at this algorithm, a good question to ask anytime you’re analyzing an algorithm is what is going to be the expensive part of doing the calculation? What’s going to take a lot of work to try to figure out? What is going to be expensive to calculate? And in particular, in the case of gradient descent, the really expensive part is this all data points part right here, having to take all of the data points and using all of those data points figure out what the gradient is at this particular setting of all of the weights. Because odds are in a big machine learning problem where you’re trying to solve a big problem with a lot of data, you have a lot of data points in order to calculate. And figuring out the gradient based on all of those data points is going to be expensive. And you’ll have to do it many times. You’ll likely repeat this process again and again and again, going through all the data points, taking one small step over and over as you try and figure out what the optimal setting of those weights happens to be. It turns out that we would ideally like to be able to train our neural networks faster, to be able to more quickly converge to some sort of solution that is going to be a good solution to the problem. So in that case, there are alternatives to just standard gradient descent, which looks at all of the data points at once. We can employ a method like stochastic gradient descent, which will randomly just choose one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points. So the idea there is that we have some setting of the weights. We pick a data point. And based on that one data point, we figure out in which direction should we move all of the weights and move the weights in that small direction, then take another data point and do that again and repeat this process again and again, maybe looking at each of the data points multiple times, but each time only using one data point to calculate the gradient, to calculate which direction we should move in. Now, just using one data point instead of all of the data points probably gives us a less accurate estimate of what the gradient actually is. But on the plus side, it’s going to be much faster to be able to calculate, that we can much more quickly calculate what the gradient is based on one data point, instead of calculating based on all of the data points and having to do all of that computational work again and again. So there are trade-offs here between looking at all of the data points and just looking at one data point. And it turns out that a middle ground that is also quite popular is a technique called mini-batch gradient descent, where the idea there is instead of looking at all of the data versus just a single point, we instead divide our data set up into small batches, groups of data points, where you can decide how big a particular batch is. But in short, you’re just going to look at a small number of points at any given time, hopefully getting a more accurate estimate of the gradient, but also not requiring all of the computational effort needed to look at every single one of these data points. So gradient descent, then, is this technique that we can use in order to train these neural networks, in order to figure out what the setting of all of these weights should be if we want some way to try and get an accurate notion of how it is that this function should work, some way of modeling how to transform the inputs into particular outputs. Now, so far, the networks that we’ve taken a look at have all been structured similar to this. We have some number of inputs, maybe two or three or five or more. And then we have one output that is just predicting like rain or no rain or just predicting one particular value. But often in machine learning problems, we don’t just care about one output. We might care about an output that has multiple different values associated with it. So in the same way that we could take a neural network and add units to the input layer, we can likewise add inputs or add outputs to the output layer as well. Instead of just one output, you could imagine we have two outputs, or we could have four outputs, for example, where in each case, as we add more inputs or add more outputs, if we want to keep this network fully connected between these two layers, we just need to add more weights, that now each of these input nodes has four weights associated with each of the four outputs. And that’s true for each of these various different input nodes. So as we add nodes, we add more weights in order to make sure that each of the inputs can somehow be connected to each of the outputs so that each output value can be calculated based on what the value of the input happens to be. So what might a case be where we want multiple different output values? Well, you might consider that in the case of weather predicting, for example, we might not just care whether it’s raining or not raining. There might be multiple different categories of weather that we would like to categorize the weather into. With just a single output variable, we can do a binary classification, like rain or no rain, for instance, 1 or 0. But it doesn’t allow us to do much more than that. With multiple output variables, I might be able to use each one to predict something a little different. Maybe I want to categorize the weather into one of four different categories, something like is it going to be raining or sunny or cloudy or snowy. And I now have four output variables that can be used to represent maybe the probability that it is rainy as opposed to sunny as opposed to cloudy or as opposed to snowy. How then would this neural network work? Well, we have some input variables that represent some data that we have collected about the weather. Each of those inputs gets multiplied by each of these various different weights. We have more multiplications to do, but these are fairly quick mathematical operations to perform. And then what we get is after passing them through some sort of activation function in the outputs, we end up getting some sort of number, where that number, you might imagine, you could interpret as a probability, like a probability that it is one category as opposed to another category. So here we’re saying that based on the inputs, we think there is a 10% chance that it’s raining, a 60% chance that it’s sunny, a 20% chance of cloudy, a 10% chance that it’s snowy. And given that output, if these represent a probability distribution, well, then you could just pick whichever one has the highest value, in this case, sunny, and say that, well, most likely, we think that this categorization of inputs means that the output should be snowy or should be sunny. And that is what we would expect the weather to be in this particular instance. And so this allows us to do these sort of multi-class classifications, where instead of just having a binary classification, 1 or 0, we can have as many different categories as we want. And we can have our neural network output these probabilities over which categories are more likely than other categories. And using that data, we’re able to draw some sort of inference on what it is that we should do. So this was sort of the idea of supervised machine learning. I can give this neural network a whole bunch of data, a whole bunch of input data corresponding to some label, some output data, like we know that it was raining on this day, we know that it was sunny on that day. And using all of that data, the algorithm can use gradient descent to figure out what all of the weights should be in order to create some sort of model that hopefully allows us a way to predict what we think the weather is going to be. But neural networks have a lot of other applications as well. You could imagine applying the same sort of idea to a reinforcement learning sort of example as well, where you remember that in reinforcement learning, what we wanted to do is train some sort of agent to learn what action to take, depending on what state they currently happen to be in. So depending on the current state of the world, we wanted the agent to pick from one of the available actions that is available to them. And you might model that by having each of these input variables represent some information about the state, some data about what state our agent is currently in. And then the output, for example, could be each of the various different actions that our agent could take, action 1, 2, 3, and 4. And you might imagine that this network would work in the same way, but based on these particular inputs, we go ahead and calculate values for each of these outputs. And those outputs could model which action is better than other actions. And we could just choose, based on looking at those outputs, which action we should take. And so these neural networks are very broadly applicable, that all they’re really doing is modeling some mathematical function. So anything that we can frame as a mathematical function, something like classifying inputs into various different categories or figuring out based on some input state what action we should take, these are all mathematical functions that we could attempt to model by taking advantage of this neural network structure, and in particular, taking advantage of this technique, gradient descent, that we can use in order to figure out what the weights should be in order to do this sort of calculation. Now, how is it that you would go about training a neural network that has multiple outputs instead of just one? Well, with just a single output, we could see what the output for that value should be, and then you update all of the weights that corresponded to it. And when we have multiple outputs, at least in this particular case, we can really think of this as four separate neural networks, that really we just have one network here that has these three inputs corresponding with these three weights corresponding to this one output value. And the same thing is true for this output value. This output value effectively defines yet another neural network that has these same three inputs, but a different set of weights that correspond to this output. And likewise, this output has its own set of weights as well, and same thing for the fourth output too. And so if you wanted to train a neural network that had four outputs instead of just one, in this case where the inputs are directly connected to the outputs, you could really think of this as just training four independent neural networks. We know what the outputs for each of these four should be based on our input data, and using that data, we can begin to figure out what all of these individual weights should be. And maybe there’s an additional step at the end to make sure that we turn these values into a probability distribution such that we can interpret which one is better than another or more likely than another as a category or something like that. So this then seems like it does a pretty good job of taking inputs and trying to predict what outputs should be. And we’ll see some real examples of this in just a moment as well. But it’s important then to think about what the limitations of this sort of approach is, of just taking some linear combination of inputs and passing it into some sort of activation function. And it turns out that when we do this in the case of binary classification, trying to predict does it belong to one category or another, we can only predict things that are linearly separable. Because we’re taking a linear combination of inputs and using that to define some decision boundary or threshold, then what we get is a situation where if we have this set of data, we can predict a line that separates linearly the red points from the blue points, but a single unit that is making a binary classification, otherwise known as a perceptron, can’t deal with a situation like this, where we’ve seen this type of situation before, where there is no straight line that just goes straight through the data that will divide the red points away from the blue points. It’s a more complex decision boundary. The decision boundary somehow needs to capture the things inside of this circle. And there isn’t really a line that will allow us to deal with that. So this is the limitation of the perceptron, these units that just make these binary decisions based on their inputs, that a single perceptron is only capable of learning a linearly separable decision boundary. All it can do is define a line. And sure, it can give us probabilities based on how close to that decision boundary we are, but it can only really decide based on a linear decision boundary. And so this doesn’t seem like it’s going to generalize well to situations where real world data is involved, because real world data often isn’t linearly separable. It often isn’t the case that we can just draw a line through the data and be able to divide it up into multiple groups. So what then is the solution to this? Well, what was proposed was the idea of a multilayer neural network, that so far all of the neural networks we’ve seen have had a set of inputs and a set of outputs, and the inputs are connected to those outputs. But in a multilayer neural network, this is going to be an artificial neural network that has an input layer still. It has an output layer, but also has one or more hidden layers in between. Other layers of artificial neurons or units that are going to calculate their own values as well. So instead of a neural network that looks like this with three inputs and one output, you might imagine in the middle here injecting a hidden layer, something like this. This is a hidden layer that has four nodes. You could choose how many nodes or units end up going into the hidden layer. You can have multiple hidden layers as well. And so now each of these inputs isn’t directly connected to the output. Each of the inputs is connected to this hidden layer. And then all of the nodes in the hidden layer, those are connected to the one output. And so this is just another step that we can take towards calculating more complex functions. Each of these hidden units will calculate its output value, otherwise known as its activation, based on a linear combination of all the inputs. And once we have values for all of these nodes, as opposed to this just being the output, we do the same thing again. Calculate the output for this node based on multiplying each of the values for these units by their weights as well. So in effect, the way this works is that we start with inputs. They get multiplied by weights in order to calculate values for the hidden nodes. Those get multiplied by weights in order to figure out what the ultimate output is going to be. And the advantage of layering things like this is it gives us an ability to model more complex functions, that instead of just having a single decision boundary, a single line dividing the red points from the blue points, each of these hidden nodes can learn a different decision boundary. And we can combine those decision boundaries to figure out what the ultimate output is going to be. And as we begin to imagine more complex situations, you could imagine each of these nodes learning some useful property or learning some useful feature of all of the inputs and us somehow learning how to combine those features together in order to get the output that we actually want. Now, the natural question when we begin to look at this now is to ask the question of, how do we train a neural network that has hidden layers inside of it? And this turns out to initially be a bit of a tricky question, because the input data that we are given is we are given values for all of the inputs, and we’re given what the value of the output should be, what the category is, for example. But the input data doesn’t tell us what the values for all of these nodes should be. So we don’t know how far off each of these nodes actually is because we’re only given data for the inputs and the outputs. The reason this is called the hidden layer is because the data that is made available to us doesn’t tell us what the values for all of these intermediate nodes should actually be. And so the strategy people came up with was to say that if you know what the error or the losses on the output node, well, then based on what these weights are, if one of these weights is higher than another, you can calculate an estimate for how much the error from this node was due to this part of the hidden node, or this part of the hidden layer, or this part of the hidden layer, based on the values of these weights, in effect saying that based on the error from the output, I can back propagate the error and figure out an estimate for what the error is for each of these nodes in the hidden layer as well. And there’s some more calculus here that we won’t get into the details of, but the idea of this algorithm is known as back propagation. It’s an algorithm for training a neural network with multiple different hidden layers. And the idea for this, the pseudocode for it, will again be if we want to run gradient descent with back propagation. We’ll start with a random choice of weights, as we did before. And now we’ll go ahead and repeat the training process again and again. But what we’re going to do each time is now we’re going to calculate the error for the output layer first. We know the output and what it should be, and we know what we calculated so we can figure out what the error there is. But then we’re going to repeat for every layer, starting with the output layer, moving back into the hidden layer, then the hidden layer before that if there are multiple hidden layers, going back all the way to the very first hidden layer, assuming there are multiple, we’re going to propagate the error back one layer. Whatever the error was from the output, figure out what the error should be a layer before that based on what the values of those weights are. And then we can update those weights. So graphically, the way you might think about this is that we first start with the output. We know what the output should be. We know what output we calculated. And based on that, we can figure out, all right, how do we need to update those weights? Backpropagating the error to these nodes. And using that, we can figure out how we should update these weights. And you might imagine if there are multiple layers, we could repeat this process again and again to begin to figure out how all of these weights should be updated. And this backpropagation algorithm is really the key algorithm that makes neural networks possible. It makes it possible to take these multi-level structures and be able to train those structures depending on what the values of these weights are in order to figure out how it is that we should go about updating those weights in order to create some function that is able to minimize the total amount of loss, to figure out some good setting of the weights that will take the inputs and translate it into the output that we expect. And this works, as we said, not just for a single hidden layer. But you can imagine multiple hidden layers, where each hidden layer we just define however many nodes we want, where each of the nodes in one layer, we can connect to the nodes in the next layer, defining more and more complex networks that are able to model more and more complex types of functions. And so this type of network is what we might call a deep neural network, part of a larger family of deep learning algorithms, if you’ve ever heard that term. And all deep learning is about is it’s using multiple layers to be able to predict and be able to model higher level features inside of the input, to be able to figure out what the output should be. And so a deep neural network is just a neural network that has multiple of these hidden layers, where we start at the input, calculate values for this layer, then this layer, then this layer, and then ultimately get an output. And this allows us to be able to model more and more sophisticated types of functions, that each of these layers can calculate something a little bit different, and we can combine that information to figure out what the output should be. Of course, as with any situation of machine learning, as we begin to make our models more and more complex, to model more and more complex functions, the risk we run is something like overfitting. And we talked about overfitting last time in the context of overfitting based on when we were training our models to be able to learn some sort of decision boundary, where overfitting happens when we fit too closely to the training data. And as a result, we don’t generalize well to other situations as well. And one of the risks we run with a far more complex neural network that has many, many different nodes is that we might overfit based on the input data. We might grow over reliant on certain nodes to calculate things just purely based on the input data that doesn’t allow us to generalize very well to the output. And there are a number of strategies for dealing with overfitting. But one of the most popular in the context of neural networks is a technique known as dropout. And what dropout does is it, when we’re training the neural network, what we’ll do in dropout is temporarily remove units, temporarily remove these artificial neurons from our network chosen at random. And the goal here is to prevent over-reliance on certain units. What generally happens in overfitting is that we begin to over-rely on certain units inside the neural network to be able to tell us how to interpret the input data. What dropout will do is randomly remove some of these units in order to reduce the chance that we over-rely on certain units to make our neural network more robust, to be able to handle the situations even when we just drop out particular neurons entirely. So the way that might work is we have a network like this. And as we’re training it, when we go about trying to update the weights the first time, we’ll just randomly pick some percentage of the nodes to drop out of the network. It’s as if those nodes aren’t there at all. It’s as if the weights associated with those nodes aren’t there at all. And we’ll train it this way. Then the next time we update the weights, we’ll pick a different set and just go ahead and train that way. And then again, randomly choose and train with other nodes that have been dropped out as well. And the goal of that is that after the training process, if you train by dropping out random nodes inside of this neural network, you hopefully end up with a network that’s a little bit more robust, that doesn’t rely too heavily on any one particular node, but more generally learns how to approximate a function in general. So that then is a look at some of these techniques that we can use in order to implement a neural network, to get at the idea of taking this input, passing it through these various different layers in order to produce some sort of output. And what we’d like to do now is take those ideas and put them into code. And to do that, there are a number of different machine learning libraries, neural network libraries that we can use that allow us to get access to someone’s implementation of back propagation and all of these hidden layers. And one of the most popular, developed by Google, is known as TensorFlow, a library that we can use for quickly creating neural networks and modeling them and running them on some sample data to see what the output is going to be. And before we actually start writing code, we’ll go ahead and take a look at TensorFlow’s playground, which will be an opportunity for us just to play around with this idea of neural networks in different layers, just to get a sense for what it is that we can do by taking advantage of neural networks. So let’s go ahead and go into TensorFlow’s playground, which you can go to by visiting that URL from before. And what we’re going to do now is we’re going to try and learn the decision boundary for this particular output. I want to learn to separate the orange points from the blue points. And I’d like to learn some sort of setting of weights inside of a neural network that will be able to separate those from each other. The features we have access to, our input data, are the x value and the y value, so the two values along each of the two axes. And what I’ll do now is I can set particular parameters, like what activation function I would like to use. And I’ll just go ahead and press play and see what happens. And what happens here is that you’ll see that just by using these two input features, the x value and the y value, with no hidden layers, just take the input, x and y values, and figure out what the decision boundary is. Our neural network learns pretty quickly that in order to divide these two points, we should just use this line. This line acts as a decision boundary that separates this group of points from that group of points, and it does it very well. You can see up here what the loss is. The training loss is 0, meaning we were able to perfectly model separating these two points from each other inside of our training data. So this was a fairly simple case of trying to apply a neural network because the data is very clean. It’s very nicely linearly separable. We could just draw a line that separates all of those points from each other. Let’s now consider a more complex case. So I’ll go ahead and pause the simulation, and we’ll go ahead and look at this data set here. This data set is a little bit more complex now. In this data set, we still have blue and orange points that we’d like to separate from each other. But there’s no single line that we can draw that is going to be able to figure out how to separate the blue from the orange, because the blue is located in these two quadrants, and the orange is located here and here. It’s a more complex function to be able to learn. So let’s see what happens. If we just try and predict based on those inputs, the x and y coordinates, what the output should be, I’ll press Play. And what you’ll notice is that we’re not really able to draw much of a conclusion, that we’re not able to very cleanly see how we should divide the orange points from the blue points, and you don’t see a very clean separation there. So it seems like we don’t have enough sophistication inside of our network to be able to model something that is that complex. We need a better model for this neural network. And I’ll do that by adding a hidden layer. So now I have a hidden layer that has two neurons inside of it. So I have two inputs that then go to two neurons inside of a hidden layer that then go to our output. And now I’ll press Play. And what you’ll notice here is that we’re able to do slightly better. We’re able to now say, all right, these points are definitely blue. These points are definitely orange. We’re still struggling a little bit with these points up here, though. And what we can do is we can see for each of these hidden neurons, what is it exactly that these hidden neurons are doing? Each hidden neuron is learning its own decision boundary. And we can see what that boundary is. This first neuron is learning, all right, this line that seems to separate some of the blue points from the rest of the points. This other hidden neuron is learning another line that seems to be separating the orange points in the lower right from the rest of the points. So that’s why we’re able to figure out these two areas in the bottom region. But we’re still not able to perfectly classify all of the points. So let’s go ahead and add another neuron. Now we’ve got three neurons inside of our hidden layer and see what we’re able to learn now. All right, well, now we seem to be doing a better job. By learning three different decision boundaries, which each of the three neurons inside of our hidden layer, we’re able to much better figure out how to separate these blue points from the orange points. And we can see what each of these hidden neurons is learning. Each one is learning a slightly different decision boundary. And then we’re combining those decision boundaries together to figure out what the overall output should be. And then we can try it one more time by adding a fourth neuron there and try learning that. And it seems like now we can do even better at trying to separate the blue points from the orange points. But we were only able to do this by adding a hidden layer, by adding some layer that is learning some other boundaries and combining those boundaries to determine the output. And the strength, the size and thickness of these lines indicate how high these weights are, how important each of these inputs is for making this sort of calculation. And we can do maybe one more simulation. Let’s go ahead and try this on a data set that looks like this. Go ahead and get rid of the hidden layer. Here now we’re trying to separate the blue points from the orange points where all the blue points are located, again, inside of a circle effectively. So we’re not going to be able to learn a line. Notice I press Play. And we’re really not able to draw any sort of classification at all because there is no line that cleanly separates the blue points from the orange points. So let’s try to solve this by introducing a hidden layer. I’ll go ahead and press Play. And all right, with two neurons in a hidden layer, we’re able to do a little better because we effectively learned two different decision boundaries. We learned this line here. And we learned this line on the right-hand side. And right now we’re just saying, all right, well, if it’s in between, we’ll call it blue. And if it’s outside, we’ll call it orange. So not great, but certainly better than before, that we’re learning one decision boundary and another. And based on those, we can figure out what the output should be. But let’s now go ahead and add a third neuron and see what happens now. I go ahead and train it. And now, using three different decision boundaries that are learned by each of these hidden neurons, we’re able to much more accurately model this distinction between blue points and orange points. We’re able to figure out maybe with these three decision boundaries, combining them together, you can imagine figuring out what the output should be and how to make that sort of classification. And so the goal here is just to get a sense for having more neurons in these hidden layers allows us to learn more structure in the data, allows us to figure out what the relevant and important decision boundaries are. And then using this backpropagation algorithm, we’re able to figure out what the values of these weights should be in order to train this network to be able to classify one category of points away from another category of points instead. And this is ultimately what we’re going to be trying to do whenever we’re training a neural network. So let’s go ahead and actually see an example of this. You’ll recall from last time that we had this banknotes file that included information about counterfeit banknotes as opposed to authentic banknotes, where I had four different values for each banknote and then a categorization of whether that banknote is considered to be authentic or a counterfeit note. And what I wanted to do was, based on that input information, figure out some function that could calculate based on the input information what category it belonged to. And what I’ve written here in banknotes.py is a neural network that will learn just that, a network that learns based on all of the input whether or not we should categorize a banknote as authentic or as counterfeit. The first step is the same as what we saw from last time. I’m really just reading the data in and getting it into an appropriate format. And so this is where more of the writing Python code on your own comes in, in terms of manipulating this data, massaging the data into a format that will be understood by a machine learning library like scikit-learn or like TensorFlow. And so here I separate it into a training and a testing set. And now what I’m doing down below is I’m creating a neural network. Here I’m using TF, which stands for TensorFlow. Up above, I said import TensorFlow as TF, TF just an abbreviation that we’ll often use so we don’t need to write out TensorFlow every time we want to use anything inside of the library. I’m using TF.keras. Keras is an API, a set of functions that we can use in order to manipulate neural networks inside of TensorFlow. And it turns out there are other machine learning libraries that also use the Keras API. But here I’m saying, all right, go ahead and give me a model that is a sequential model, a sequential neural network, meaning one layer after another. And now I’m going to add to that model what layers I want inside of my neural network. So here I’m saying model.add. Go ahead and add a dense layer. And when we say a dense layer, we mean a layer that is just each of the nodes inside of the layer is going to be connected to each of the nodes from the previous layer. So we have a densely connected layer. This layer is going to have eight units inside of it. So it’s going to be a hidden layer inside of a neural network with eight different units, eight artificial neurons, each of which might learn something different. And I just sort of chose eight arbitrarily. You could choose a different number of hidden nodes inside of the layer. And as we saw before, depending on the number of units there are inside of your hidden layer, more units means you can learn more complex functions. So maybe you can more accurately model the training data. But it comes at the cost. More units means more weights that you need to figure out how to update. So it might be more expensive to do that calculation. And you also run the risk of overfitting on the data. If you have too many units and you learn to just overfit on the training data, that’s not good either. So there is a balance. And there’s often a testing process where you’ll train on some data and maybe validate how well you’re doing on a separate set of data, often called a validation set, to see, all right, which setting of parameters. How many layers should I have? How many units should be in each layer? Which one of those performs the best on the validation set? So you can do some testing to figure out what these hyper parameters, so called, should be equal to. Next, I specify what the input shape is. Meaning, all right, what does my input look like? My input has four values. And so the input shape is just four, because we have four inputs. And then I specify what the activation function is. And the activation function, again, we can choose. There are a number of different activation functions. Here I’m using relu, which you might recall from earlier. And then I’ll add an output layer. So I have my hidden layer. Now I’m adding one more layer that will just have one unit, because all I want to do is predict something like counterfeit build or authentic build. So I just need a single unit. And the activation function I’m going to use here is that sigmoid activation function, which, again, was that S-shaped curve that just gave us a probability of what is the probability that this is a counterfeit build, as opposed to an authentic build. So that, then, is the structure of my neural network, a sequential neural network that has one hidden layer with eight units inside of it, and then one output layer that just has a single unit inside of it. And I can choose how many units there are. I can choose the activation function. Then I’m going to compile this model. TensorFlow gives you a choice of how you would like to optimize the weights. There are various different algorithms for doing that. What type of loss function you want to use. Again, many different options for doing that. And then how I want to evaluate my model, well, I care about accuracy. I care about how many of my points am I able to classify correctly versus not correctly as counterfeit or not counterfeit. And I would like it to report to me how accurate my model is performing. Then, now that I’ve defined that model, I call model.fit to say go ahead and train the model. Train it on all the training data plus all of the training labels. So labels for each of those pieces of training data. And I’m saying run it for 20 epics, meaning go ahead and go through each of these training points 20 times, effectively. Go through the data 20 times and keep trying to update the weights. If I did it for more, I could train for even longer and maybe get a more accurate result. But then after I fit it on all the data, I’ll go ahead and just test it. I’ll evaluate my model using model.evaluate built into TensorFlow that is just going to tell me how well do I perform on the testing data. So ultimately, this is just going to give me some numbers that tell me how well we did in this particular case. So now what I’m going to do is go into banknotes and go ahead and run banknotes.py. And what’s going to happen now is it’s going to read in all of that training data. It’s going to generate a neural network with all my inputs, my eight hidden units inside my layer, and then an output unit. And now what it’s doing is it’s training. It’s training 20 times. And each time you can see how my accuracy is increasing on my training data. It starts off the very first time not very accurate, though better than random, something like 79% of the time. It’s able to accurately classify one bill from another. But as I keep training, notice this accuracy value improves and improves and improves until after I’ve trained through all the data points 20 times, it looks like my accuracy is above 99% on the training data. And here’s where I tested it on a whole bunch of testing data. And it looks like in this case, I was also like 99.8% accurate. So just using that, I was able to generate a neural network that can detect counterfeit bills from authentic bills based on this input data 99.8% of the time, at least based on this particular testing data. And I might want to test it with more data as well, just to be confident about that. But this is really the value of using a machine learning library like TensorFlow. And there are others available for Python and other languages as well. But all I have to do is define the structure of the network and define the data that I’m going to pass into the network. And then TensorFlow runs the backpropagation algorithm for learning what all of those weights should be, for figuring out how to train this neural network to be able to accurately, as accurately as possible, figure out what the output values should be there as well. And so this then was a look at what it is that neural networks can do just using these sequences of layer after layer after layer. And you can begin to imagine applying these to much more general problems. And one big problem in computing and artificial intelligence more generally is the problem of computer vision. Computer vision is all about computational methods for analyzing and understanding images. You might have pictures that you want the computer to figure out how to deal with, how to process those images and figure out how to produce some sort of useful result out of this. You’ve seen this in the context of social media websites that are able to look at a photo that contains a whole bunch of faces. And it’s able to figure out what’s a picture of whom and label those and tag them with appropriate people. This is becoming increasingly relevant as we begin to discuss self-driving cars, that these cars now have cameras. And we would like for the computer to have some sort of algorithm that looks at the image and figures out what color is the light, what cars are around us and in what direction, for example. And so computer vision is all about taking an image and figuring out what sort of computation, what sort of calculation we can do with that image. It’s also relevant in the context of something like handwriting recognition. This, what you’re looking at, is an example of the MNIST data set. It’s a big data set just of handwritten digits that we could use to ideally try and figure out how to predict, given someone’s handwriting, given a photo of a digit that they have drawn, can you predict whether it’s a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example. So this sort of handwriting recognition is yet another task that we might want to use computer vision tasks and tools to be able to apply it towards. This might be a task that we might care about. So how, then, can we use neural networks to be able to solve a problem like this? Well, neural networks rely upon some sort of input where that input is just numerical data. We have a whole bunch of units where each one of them just represents some sort of number. And so in the context of something like handwriting recognition or in the context of just an image, you might imagine that an image is really just a grid of pixels, grid of dots where each dot has some sort of color. And in the context of something like handwriting recognition, you might imagine that if you just fill in each of these dots in a particular way, you can generate a 2 or an 8, for example, based on which dots happen to be shaded in and which dots are not. And we can represent each of these pixel values just using numbers. So for a particular pixel, for example, 0 might represent entirely black. Depending on how you’re representing color, it’s often common to represent color values on a 0 to 255 range so that you can represent a color using 8 bits for a particular value, like how much white is in the image. So 0 might represent all black. 255 might represent entirely white as a pixel. And somewhere in between might represent some shade of gray, for example. But you might imagine not just having a single slider that determines how much white is in the image, but if you had a color image, you might imagine three different numerical values, a red, green, and blue value, where the red value controls how much red is in the image. We have one value for controlling how much green is in the pixel and one value for how much blue is in the pixel as well. And depending on how it is that you set these values of red, green, and blue, you can get a different color. And so any pixel can really be represented, in this case, by three numerical values, a red value, a green value, and a blue value. And if you take a whole bunch of these pixels, assemble them together inside of a grid of pixels, then you really just have a whole bunch of numerical values that you can use in order to perform some sort of prediction task. And so what you might imagine doing is using the same techniques we talked about before, just design a neural network with a lot of inputs, that for each of the pixels, we might have one or three different inputs in the case of a color image, a different input that is just connected to a deep neural network, for example. And this deep neural network might take all of the pixels inside of the image of what digit a person drew. And the output might be like 10 neurons that classify it as a 0, or a 1, or a 2, or a 3, or just tells us in some way what that digit happens to be. Now, there are a couple of drawbacks to this approach. The first drawback to the approach is just the size of this input array, that we have a whole bunch of inputs. If we have a big image that has a lot of different channels, we’re looking at a lot of inputs, and therefore a lot of weights that we have to calculate. And a second problem is the fact that by flattening everything into just this structure of all the pixels, we’ve lost access to a lot of the information about the structure of the image that’s relevant, that really, when a person looks at an image, they’re looking at particular features of the image. They’re looking at curves. They’re looking at shapes. They’re looking at what things can you identify in different regions of the image, and maybe put those things together in order to get a better picture of what the overall image is about. And by just turning it into pixel values for each of the pixels, sure, you might be able to learn that structure, but it might be challenging in order to do so. It might be helpful to take advantage of the fact that you can use properties of the image itself, the fact that it’s structured in a particular way, to be able to improve the way that we learn based on that image too. So in order to figure out how we can train our neural networks to better be able to deal with images, we’ll introduce a couple of ideas, a couple of algorithms that we can apply that allow us to take the image and extract some useful information out of that image. And the first idea we’ll introduce is the notion of image convolution. And what image convolution is all about is it’s about filtering an image, sort of extracting useful or relevant features out of the image. And the way we do that is by applying a particular filter that basically adds the value for every pixel with the values for all of the neighboring pixels to it, according to some sort of kernel matrix, which we’ll see in a moment, is going to allow us to weight these pixels in various different ways. And the goal of image convolution, then, is to extract some sort of interesting or useful features out of an image, to be able to take a pixel and, based on its neighboring pixels, maybe predict some sort of valuable information. Something like taking a pixel and looking at its neighboring pixels, you might be able to predict whether or not there’s some sort of curve inside the image, or whether it’s forming the outline of a particular line or a shape, for example. And that might be useful if you’re trying to use all of these various different features to combine them to say something meaningful about an image as a whole. So how, then, does image convolution work? Well, we start with a kernel matrix. And the kernel matrix looks something like this. And the idea of this is that, given a pixel that will be the middle pixel, we’re going to multiply each of the neighboring pixels by these values in order to get some sort of result by summing up all the numbers together. So if I take this kernel, which you can think of as a filter that I’m going to apply to the image, and let’s say that I take this image. This is a 4 by 4 image. We’ll think of it as just a black and white image, where each one is just a single pixel value. So somewhere between 0 and 255, for example. So we have a whole bunch of individual pixel values like this. And what I’d like to do is apply this kernel, this filter, so to speak, to this image. And the way I’ll do that is, all right, the kernel is 3 by 3. You can imagine a 5 by 5 kernel or a larger kernel, too. And I’ll take it and just first apply it to the first 3 by 3 section of the image. And what I’ll do is I’ll take each of these pixel values, multiply it by its corresponding value in the filter matrix, and add all of the results together. So here, for example, I’ll say 10 times 0, plus 20 times negative 1, plus 30 times 0, so on and so forth, doing all of this calculation. And at the end, if I take all these values, multiply them by their corresponding value in the kernel, add the results together, for this particular set of 9 pixels, I get the value of 10, for example. And then what I’ll do is I’ll slide this 3 by 3 grid, effectively, over. I’ll slide the kernel by 1 to look at the next 3 by 3 section. Here, I’m just sliding it over by 1 pixel. But you might imagine a different stride length, or maybe I jump by multiple pixels at a time if you really wanted to. You have different options here. But here, I’m just sliding over, looking at the next 3 by 3 section. And I’ll do the same math, 20 times 0, plus 30 times negative 1, plus 40 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. And what I end up getting is the number 20. Then you can imagine shifting over to this one, doing the same thing, calculating the number 40, for example, and then doing the same thing here, and calculating a value there as well. And so what we have now is what we’ll call a feature map. We have taken this kernel, applied it to each of these various different regions, and what we get is some representation of a filtered version of that image. And so to give a more concrete example of why it is that this kind of thing could be useful, let’s take this kernel matrix, for example, which is quite a famous one, that has an 8 in the middle, and then all of the neighboring pixels get a negative 1. And let’s imagine we wanted to apply that to a 3 by 3 part of an image that looks like this, where all the values are the same. They’re all 20, for instance. Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20, subtract 20 for each of the eight neighbors, well, the result of that is you just get that expression, which comes out to be 0. You multiplied 20 by 8, but then you subtracted 20 eight times, according to that particular kernel. The result of all that is just 0. So the takeaway here is that when a lot of the pixels are the same value, we end up getting a value close to 0. If, though, we had something like this, 20 is along this first row, then 50 is in the second row, and 50 is in the third row, well, then when you do this, because it’s the same kind of math, 20 times negative 1, 20 times negative 1, so on and so forth, then I get a higher value, a value like 90 in this particular case. And so the more general idea here is that by applying this kernel, negative 1s, 8 in the middle, and then negative 1s, what I get is when this middle value is very different from the neighboring values, like 50 is greater than these 20s, then you’ll end up with a value higher than 0. If this number is higher than its neighbors, you end up getting a bigger output. But if this value is the same as all of its neighbors, then you get a lower output, something like 0. And it turns out that this sort of filter can therefore be used in something like detecting edges in an image. Or I want to detect the boundaries between various different objects inside of an image. I might use a filter like this, which is able to tell whether the value of this pixel is different from the values of the neighboring pixel, if it’s greater than the values of the pixels that happen to surround it. And so we can use this in terms of image filtering. And so I’ll show you an example of that. I have here in filter.py a file that uses Python’s image library, or PIL, to do some image filtering. I go ahead and open an image. And then all I’m going to do is apply a kernel to that image. It’s going to be a 3 by 3 kernel, same kind of kernel we saw before. And here is the kernel. This is just a list representation of the same matrix that I showed you a moment ago. It’s negative 1, negative 1, negative 1. The second row is negative 1, 8, negative 1. And the third row is all negative 1s. And then at the end, I’m going to go ahead and show the filtered image. So if, for example, I go into convolution directory and I open up an image, like bridge.png, this is what an input image might look like, just an image of a bridge over a river. Now I’m going to go ahead and run this filter program on the bridge. And what I get is this image here. Just by taking the original image and applying that filter to each 3 by 3 grid, I’ve extracted all of the boundaries, all of the edges inside the image that separate one part of the image from another. So here I’ve got a representation of boundaries between particular parts of the image. And you might imagine that if a machine learning algorithm is trying to learn what an image is of, a filter like this could be pretty useful. Maybe the machine learning algorithm doesn’t care about all of the details of the image. It just cares about certain useful features. It cares about particular shapes that are able to help it determine that based on the image, this is going to be a bridge, for example. And so this type of idea of image convolution can allow us to apply filters to images that allow us to extract useful results out of those images, taking an image and extracting its edges, for example. And you might imagine many other filters that could be applied to an image that are able to extract particular values as well. And a filter might have separate kernels for the red values, the green values, and the blue values that are all summed together at the end, such that you could have particular filters looking for, is there red in this part of the image? Are there green in other parts of the image? You can begin to assemble these relevant and useful filters that are able to do these calculations as well. So that then was the idea of image convolution, applying some sort of filter to an image to be able to extract some useful features out of that image. But all the while, these images are still pretty big. There’s a lot of pixels involved in the image. And realistically speaking, if you’ve got a really big image, that poses a couple of problems. One, it means a lot of input going into the neural network. But two, it also means that we really have to care about what’s in each particular pixel. Whereas realistically, we often, if you’re looking at an image, you don’t care whether something is in one particular pixel versus the pixel immediately to the right of it. They’re pretty close together. You really just care about whether there’s a particular feature in some region of the image. And maybe you don’t care about exactly which pixel it happens to be in. And so there’s a technique we can use known as pooling. And what pooling is, is it means reducing the size of an input by sampling from regions inside of the input. So we’re going to take a big image and turn it into a smaller image by using pooling. And in particular, one of the most popular types of pooling is called max pooling. And what max pooling does is it pools just by choosing the maximum value in a particular region. So for example, let’s imagine I had this 4 by 4 image. But I wanted to reduce its dimensions. I wanted to make it a smaller image so that I have fewer inputs to work with. Well, what I could do is I could apply a 2 by 2 max pool, where the idea would be that I’m going to first look at this 2 by 2 region and say, what is the maximum value in that region? Well, it’s the number 50. So we’ll go ahead and just use the number 50. And then we’ll look at this 2 by 2 region. What is the maximum value here? It’s 110, so that’s going to be my value. Likewise here, the maximum value looks like 20. Go ahead and put that there. Then for this last region, the maximum value was 40. So we’ll go ahead and use that. And what I have now is a smaller representation of this same original image that I obtained just by picking the maximum value from each of these regions. So again, the advantages here are now I only have to deal with a 2 by 2 input instead of a 4 by 4. And you can imagine shrinking the size of an image even more. But in addition to that, I’m now able to make my analysis independent of whether a particular value was in this pixel or this pixel. I don’t care if the 50 was here or here. As long as it was generally in this region, I’ll still get access to that value. So it makes our algorithms a little bit more robust as well. So that then is pooling, taking the size of the image, reducing it a little bit by just sampling from particular regions inside of the image. And now we can put all of these ideas together, pooling, image convolution, and neural networks all together into another type of neural network called a convolutional neural network, or a CNN, which is a neural network that uses this convolution step usually in the context of analyzing an image, for example. And so the way that a convolutional neural network works is that we start with some sort of input image, some grid of pixels. But rather than immediately put that into the neural network layers that we’ve seen before, we’ll start by applying a convolution step, where the convolution step involves applying some number of different image filters to our original image in order to get what we call a feature map, the result of applying some filter to an image. And we could do this once, but in general, we’ll do this multiple times, getting a whole bunch of different feature maps, each of which might extract some different relevant feature out of the image, some different important characteristic of the image that we might care about using in order to calculate what the result should be. And in the same way that when we train neural networks, we can train neural networks to learn the weights between particular units inside of the neural networks, we can also train neural networks to learn what those filters should be, what the values of the filters should be in order to get the most useful, most relevant information out of the original image just by figuring out what setting of those filter values, the values inside of that kernel, results in minimizing the loss function, minimizing how poorly our hypothesis actually performs in figuring out the classification of a particular image, for example. So we first apply this convolution step, get a whole bunch of these various different feature maps. But these feature maps are quite large. There’s a lot of pixel values that happen to be here. And so a logical next step to take is a pooling step, where we reduce the size of these images by using max pooling, for example, extracting the maximum value from any particular region. There are other pooling methods that exist as well, depending on the situation. You could use something like average pooling, where instead of taking the maximum value from a region, you take the average value from a region, which has its uses as well. But in effect, what pooling will do is it will take these feature maps and reduce their dimensions so that we end up with smaller grids with fewer pixels. And this then is going to be easier for us to deal with. It’s going to mean fewer inputs that we have to worry about. And it’s also going to mean we’re more resilient, more robust against potential movements of particular values, just by one pixel, when ultimately we really don’t care about those one-pixel differences that might arise in the original image. And now, after we’ve done this pooling step, now we have a whole bunch of values that we can then flatten out and just put into a more traditional neural network. So we go ahead and flatten it, and then we end up with a traditional neural network that has one input for each of these values in each of these resulting feature maps after we do the convolution and after we do the pooling step. And so this then is the general structure of a convolutional network. We begin with the image, apply convolution, apply pooling, flatten the results, and then put that into a more traditional neural network that might itself have hidden layers. You can have deep convolutional networks that have hidden layers in between this flattened layer and the eventual output to be able to calculate various different features of those values. But this then can help us to be able to use convolution and pooling to use our knowledge about the structure of an image to be able to get better results, to be able to train our networks faster in order to better capture particular parts of the image. And there’s no reason necessarily why you can only use these steps once. In fact, in practice, you’ll often use convolution and pooling multiple times in multiple different steps. See, what you might imagine doing is starting with an image, first applying convolution to get a whole bunch of maps, then applying pooling, then applying convolution again, because these maps are still pretty big. You can apply convolution to try and extract relevant features out of this result. Then take those results, apply pooling in order to reduce their dimensions, and then take that and feed it into a neural network that maybe has fewer inputs. So here I have two different convolution and pooling steps. I do convolution and pooling once, and then I do convolution and pooling a second time, each time extracting useful features from the layer before it, each time using pooling to reduce the dimensions of what you’re ultimately looking at. And the goal now of this sort of model is that in each of these steps, you can begin to learn different types of features of the original image. That maybe in the first step, you learn very low level features. Just learn and look for features like edges and curves and shapes, because based on pixels and their neighboring values, you can figure out, all right, what are the edges? What are the curves? What are the various different shapes that might be present there? But then once you have a mapping that just represents where the edges and curves and shapes happen to be, you can imagine applying the same sort of process again to begin to look for higher level features, look for objects, maybe look for people’s eyes and facial recognition, for example. Maybe look for more complex shapes like the curves on a particular number if you’re trying to recognize a digit in a handwriting recognition sort of scenario. And then after all of that, now that you have these results that represent these higher level features, you can pass them into a neural network, which is really just a deep neural network that looks like this, where you might imagine making a binary classification or classifying into multiple categories or performing various different tasks on this sort of model. So convolutional neural networks can be quite powerful and quite popular when it comes towards trying to analyze images. We don’t strictly need them. We could have just used a vanilla neural network that just operates with layer after layer, as we’ve seen before. But these convolutional neural networks can be quite helpful, in particular, because of the way they model the way a human might look at an image, that instead of a human looking at every single pixel simultaneously and trying to convolve all of them by multiplying them together, you might imagine that what convolution is really doing is looking at various different regions of the image and extracting relevant information and features out of those parts of the image, the same way that a human might have visual receptors that are looking at particular parts of what they see and using those combining them to figure out what meaning they can draw from all of those various different inputs. And so you might imagine applying this to a situation like handwriting recognition. So we’ll go ahead and see an example of that now, where I’ll go ahead and open up handwriting.py. Again, what we do here is we first import TensorFlow. And then TensorFlow, it turns out, has a few data sets that are built into the library that you can just immediately access. And one of the most famous data sets in machine learning is the MNIST data set, which is just a data set of a whole bunch of samples of people’s handwritten digits. I showed you a slide of that a little while ago. And what we can do is just immediately access that data set which is built into the library so that if I want to do something like train on a whole bunch of handwritten digits, I can just use the data set that is provided to me. Of course, if I had my own data set of handwritten images, I can apply the same idea. I’d first just need to take those images and turn them into an array of pixels, because that’s the way that these are going to be formatted. They’re going to be formatted as, effectively, an array of individual pixels. Now there’s a bit of reshaping I need to do, just turning the data into a format that I can put into my convolutional neural network. So this is doing things like taking all the values and dividing them by 255. If you remember, these color values tend to range from 0 to 255. So I can divide them by 255 just to put them into 0 to 1 range, which might be a little bit easier to train on. And then doing various other modifications to the data just to get it into a nice usable format. But here’s the interesting and important part. Here is where I create the convolutional neural network, the CNN, where here I’m saying, go ahead and use a sequential model. And before I could use model.add to say add a layer, add a layer, add a layer, another way I could define it is just by passing as input to this sequential neural network a list of all of the layers that I want. And so here, the very first layer in my model is a convolution layer, where I’m first going to apply convolution to my image. I’m going to use 13 different filters. So my model is going to learn 32, rather, 32 different filters that I would like to learn on the input image, where each filter is going to be a 3 by 3 kernel. So we saw those 3 by 3 kernels before, where we could multiply each value in a 3 by 3 grid by a value, multiply it, and add all the results together. So here, I’m going to learn 32 different of these 3 by 3 filters. I can, again, specify my activation function. And I specify what my input shape is. My input shape in the banknotes case was just 4. I had 4 inputs. My input shape here is going to be 28, 28, 1, because for each of these handwritten digits, it turns out that the MNIST data set organizes their data. Each image is a 28 by 28 pixel grid. So we’re going to have a 28 by 28 pixel grid. And each one of those images only has one channel value. These handwritten digits are just black and white. So there’s just a single color value representing how much black or how much white. You might imagine that in a color image, if you were doing this sort of thing, you might have three different channels, a red, a green, and a blue channel, for example. But in the case of just handwriting recognition, recognizing a digit, we’re just going to use a single value for, like, shaded in or not shaded in. And it might range, but it’s just a single color value. And that, then, is the very first layer of our neural network, a convolutional layer that will take the input and learn a whole bunch of different filters that we can apply to the input to extract meaningful features. Next step is going to be a max pooling layer, also built right into TensorFlow, where this is going to be a layer that is going to use a pool size of 2 by 2, meaning we’re going to look at 2 by 2 regions inside of the image and just extract the maximum value. Again, we’ve seen why this can be helpful. It’ll help to reduce the size of our input. And once we’ve done that, we’ll go ahead and flatten all of the units just into a single layer that we can then pass into the rest of the neural network. And now, here’s the rest of the neural network. Here, I’m saying, let’s add a hidden layer to my neural network with 128 units, so a whole bunch of hidden units inside of the hidden layer. And just to prevent overfitting, I can add a dropout to that. Say, you know what, when you’re training, randomly dropout half of the nodes from this hidden layer just to make sure we don’t become over-reliant on any particular node, we begin to really generalize and stop ourselves from overfitting. So TensorFlow allows us, just by adding a single line, to add dropout into our model as well, such that when it’s training, it will perform this dropout step in order to help make sure that we don’t overfit on this particular data. And then finally, I add an output layer. The output layer is going to have 10 units, one for each category that I would like to classify digits into, so 0 through 9, 10 different categories. And the activation function I’m going to use here is called the softmax activation function. And in short, what the softmax activation function is going to do is it’s going to take the output and turn it into a probability distribution. So ultimately, it’s going to tell me, what did we estimate the probability is that this is a 2 versus a 3 versus a 4. And so it will turn it into that probability distribution for me. Next up, I’ll go ahead and compile my model and fit it on all of my training data. And then I can evaluate how well the neural network performs. And then I’ve added to my Python program, if I’ve provided a command line argument like the name of a file, I’m going to go ahead and save the model to a file. And so this can be quite useful too. Once you’ve done the training step, which could take some time in terms of taking all the time, going through the data, running back propagation with gradient descent to be able to say, all right, how should we adjust the weight to this particular model? You end up calculating values for these weights, calculating values for these filters. You’d like to remember that information so you can use it later. And so TensorFlow allows us to just save a model to a file, such that later, if we want to use the model we’ve learned, use the weights that we’ve learned to make some sort of new prediction, we can just use the model that already exists. So what we’re doing here is after we’ve done all the calculation, we go ahead and save the model to a file, such that we can use it a little bit later. So for example, if I go into digits, I’m going to run handwriting.py. I won’t save it this time. We’ll just run it and go ahead and see what happens. What will happen is we need to go through the model in order to train on all of these samples of handwritten digits. The MNIST data set gives us thousands and thousands of sample handwritten digits in the same format that we can use in order to train. And so now what you’re seeing is this training process. And unlike the banknotes case, where there was much fewer data points, the data was very, very simple, here this data is more complex and this training process takes time. And so this is another one of those cases where when training neural networks, this is why computational power is so important that oftentimes you see people wanting to use sophisticated GPUs in order to more efficiently be able to do this sort of neural network training. It also speaks to the reason why more data can be helpful. The more sample data points you have, the better you can begin to do this training. So here we’re going through 60,000 different samples of handwritten digits. And I said we’re going to go through them 10 times. We’re going to go through the data set 10 times, training each time, hopefully improving upon our weights with every time we run through this data set. And we can see over here on the right what the accuracy is each time we go ahead and run this model, that the first time it looks like we got an accuracy of about 92% of the digits correct based on this training set. We increased that to 96% or 97%. And every time we run this, we’re going to see hopefully the accuracy improve as we continue to try and use that gradient descent, that process of trying to run the algorithm, to minimize the loss that we get in order to more accurately predict what the output should be. And what this process is doing is it’s learning not only the weights, but it’s learning the features to use, the kernel matrix to use when performing that convolution step. Because this is a convolutional neural network, where I’m first performing those convolutions and then doing the more traditional neural network structure, this is going to learn all of those individual steps as well. And so here we see the TensorFlow provides me with some very nice output, telling me about how many seconds are left with each of these training runs that allows me to see just how well we’re doing. So we’ll go ahead and see how this network performs. It looks like we’ve gone through the data set seven times. We’re going through it an eighth time now. And at this point, the accuracy is pretty high. We saw we went from 92% up to 97%. Now it looks like 98%. And at this point, it seems like things are starting to level out. It’s probably a limit to how accurate we can ultimately be without running the risk of overfitting. Of course, with enough nodes, you would just memorize the input and overfit upon them. But we’d like to avoid doing that. And Dropout will help us with this. But now we see we’re almost done finishing our training step. We’re at 55,000. All right, we finished training. And now it’s going to go ahead and test for us on 10,000 samples. And it looks like on the testing set, we were at 98.8% accurate. So we ended up doing pretty well, it seems, on this testing set to see how accurately can we predict these handwritten digits. And so what we could do then is actually test it out. I’ve written a program called Recognition.py using PyGame. If you pass it a model that’s been trained, and I pre-trained an example model using this input data, what we can do is see whether or not we’ve been able to train this convolutional neural network to be able to predict handwriting, for example. So I can try, just like drawing a handwritten digit. I’ll go ahead and draw the number 2, for example. So there’s my number 2. Again, this is messy. If you tried to imagine, how would you write a program with just ifs and thens to be able to do this sort of calculation, it would be tricky to do so. But here I’ll press Classify, and all right, it seems I was able to correctly classify that what I drew was the number 2. I’ll go ahead and reset it, try it again. We’ll draw an 8, for example. So here is an 8. Press Classify. And all right, it predicts that the digit that I drew was an 8. And the key here is this really begins to show the power of what the neural network is doing, somehow looking at various different features of these different pixels, figuring out what the relevant features are, and figuring out how to combine them to get a classification. And this would be a difficult task to provide explicit instructions to the computer on how to do, to use a whole bunch of ifs ands to process all these pixel values to figure out what the handwritten digit is. Everyone’s going to draw their 8s a little bit differently. If I drew the 8 again, it would look a little bit different. And yet, ideally, we want to train a network to be robust enough so that it begins to learn these patterns on its own. All I said was, here is the structure of the network, and here is the data on which to train the network. And the network learning algorithm just tries to figure out what is the optimal set of weights, what is the optimal set of filters to use them in order to be able to accurately classify a digit into one category or another. Just going to show the power of these sorts of convolutional neural networks. And so that then was a look at how we can use convolutional neural networks to begin to solve problems with regards to computer vision, the ability to take an image and begin to analyze it. So this is the type of analysis you might imagine that’s happening in self-driving cars that are able to figure out what filters to apply to an image to understand what it is that the computer is looking at, or the same type of idea that might be applied to facial recognition and social media to be able to determine how to recognize faces in an image as well. You can imagine a neural network that instead of classifying into one of 10 different digits could instead classify like, is this person A or is this person B, trying to tell those people apart just based on convolution. And so now what we’ll take a look at is yet another type of neural network that can be quite popular for certain types of tasks. But to do so, we’ll try to generalize and think about our neural network a little bit more abstractly. That here we have a sample deep neural network where we have this input layer, a whole bunch of different hidden layers that are performing certain types of calculations, and then an output layer here that just generates some sort of output that we care about calculating. But we could imagine representing this a little more simply like this. Here is just a more abstract representation of our neural network. We have some input that might be like a vector of a whole bunch of different values as our input. That gets passed into a network that performs some sort of calculation or computation, and that network produces some sort of output. That output might be a single value. It might be a whole bunch of different values. But this is the general structure of the neural network that we’ve seen. There is some sort of input that gets fed into the network. And using that input, the network calculates what the output should be. And this sort of model for a neural network is what we might call a feed-forward neural network. Feed-forward neural networks have connections only in one direction. They move from one layer to the next layer to the layer after that, such that the inputs pass through various different hidden layers and then ultimately produce some sort of output. So feed-forward neural networks were very helpful for solving these types of classification problems that we saw before. We have a whole bunch of input. We want to learn what setting of weights will allow us to calculate the output effectively. But there are some limitations on feed-forward neural networks that we’ll see in a moment. In particular, the input needs to be of a fixed shape, like a fixed number of neurons are in the input layer. And there’s a fixed shape for the output, like a fixed number of neurons in the output layer. And that has some limitations of its own. And a possible solution to this, and we’ll see examples of the types of problems we can solve for this in just a second, is instead of just a feed-forward neural network, where there are only connections in one direction from left to right effectively across the network, we could also imagine a recurrent neural network, where a recurrent neural network generates output that gets fed back into itself as input for future runs of that network. So whereas in a traditional neural network, we have inputs that get fed into the network, that get fed into the output. And the only thing that determines the output is based on the original input and based on the calculation we do inside of the network itself. This goes in contrast with a recurrent neural network, where in a recurrent neural network, you can imagine output from the network feeding back to itself into the network again as input for the next time you do the calculations inside of the network. What this allows is it allows the network to maintain some sort of state, to store some sort of information that can be used on future runs of the network. Previously, the network just defined some weights, and we passed inputs through the network, and it generated outputs. But the network wasn’t saving any information based on those inputs to be able to remember for future iterations or for future runs. What a recurrent neural network will let us do is let the network store information that gets passed back in as input to the network again the next time we try and perform some sort of action. And this is particularly helpful when dealing with sequences of data. So we’ll see a real world example of this right now, actually. Microsoft has developed an AI known as the caption bot. And what the caption bot does is it says, I can understand the content of any photograph, and I’ll try to describe it as well as any human. I’ll analyze your photo, but I won’t store it or share it. And so what Microsoft’s caption bot seems to be claiming to do is it can take an image and figure out what’s in the image and just give us a caption to describe it. So let’s try it out. Here, for example, is an image of Harvard Square. It’s some people walking in front of one of the buildings at Harvard Square. I’ll go ahead and take the URL for that image, and I’ll paste it into caption bot and just press Go. So caption bot is analyzing the image, and then it says, I think it’s a group of people walking in front of a building, which seems amazing. The AI is able to look at this image and figure out what’s in the image. And the important thing to recognize here is that this is no longer just a classification task. We saw being able to classify images with a convolutional neural network where the job was take the image and then figure out, is it a 0 or a 1 or a 2, or is it this person’s face or that person’s face? What seems to be happening here is the input is an image, and we know how to get networks to take input of images, but the output is text. It’s a sentence. It’s a phrase, like a group of people walking in front of a building. And this would seem to pose a challenge for our more traditional feed-forward neural networks, for the reason being that in traditional neural networks, we just have a fixed-size input and a fixed-size output. There are a certain number of neurons in the input to our neural network and a certain number of outputs for our neural network, and then some calculation that goes on in between. But the size of the inputs and the number of values in the input and the number of values in the output, those are always going to be fixed based on the structure of the neural network. And that makes it difficult to imagine how a neural network could take an image like this and say it’s a group of people walking in front of the building because the output is text, like it’s a sequence of words. Now, it might be possible for a neural network to output one word, one word you could represent as a vector of values, and you can imagine ways of doing that. Next time, we’ll talk a little bit more about AI as it relates to language and language processing. But a sequence of words is much more challenging because depending on the image, you might imagine the output is a different number of words. We could have sequences of different lengths, and somehow we still want to be able to generate the appropriate output. And so the strategy here is to use a recurrent neural network, a neural network that can feed its own output back into itself as input for the next time. And this allows us to do what we call a one-to-many relationship for inputs to outputs, that in vanilla, more traditional neural networks, these are what we might consider to be one-to-one neural networks. You pass in one set of values as input. You get one vector of values as the output. But in this case, we want to pass in one value as input, the image, and we want to get a sequence, many values as output, where each value is like one of these words that gets produced by this particular algorithm. And so the way we might do this is we might imagine starting by providing input, the image, into our neural network. And the neural network is going to generate output, but the output is not going to be the whole sequence of words, because we can’t represent the whole sequence of words using just a fixed set of neurons. Instead, the output is just going to be the first word. We’re going to train the network to output what the first word of the caption should be. And you could imagine that Microsoft has trained this by running a whole bunch of training samples through the AI, giving it a whole bunch of pictures and what the appropriate caption was, and having the AI begin to learn from that. But now, because the network generates output that can be fed back into itself, you could imagine the output of the network being fed back into the same network. This here looks like a separate network, but it’s really the same network that’s just getting different input, that this network’s output gets fed back into itself, but it’s going to generate another output. And that other output is going to be the second word in the caption. And this recurrent neural network then, this network is going to generate other output that can be fed back into itself to generate yet another word, fed back into itself to generate another word. And so recurrent neural networks allow us to represent this one-to-many structure. You provide one image as input, and the neural network can pass data into the next run of the network, and then again and again, such that you could run the network multiple times, each time generating a different output still based on that original input. And this is where recurrent neural networks become particularly useful when dealing with sequences of inputs or outputs. And my output is a sequence of words, and since I can’t very easily represent outputting an entire sequence of words, I’ll instead output that sequence one word at a time by allowing my network to pass information about what still needs to be said about the photo into the next stage of running the network. So you could run the network multiple times, the same network with the same weights, just getting different input each time. First, getting input from the image, and then getting input from the network itself as additional information about what additionally needs to be given in a particular caption, for example. So this then is a one-to-many relationship inside of a recurrent neural network, but it turns out there are other models that we can use, other ways we can try and use recurrent neural networks to be able to represent data that might be stored in other forms as well. We saw how we could use neural networks in order to analyze images in the context of convolutional neural networks that take an image, figure out various different properties of the image, and are able to draw some sort of conclusion based on that. But you might imagine that something like YouTube, they need to be able to do a lot of learning based on video. They need to look through videos to detect if they’re like copyright violations, or they need to be able to look through videos to maybe identify what particular items are inside of the video, for example. And video, you might imagine, is much more difficult to put in as input to a neural network, because whereas an image, you could just treat each pixel as a different value, videos are sequences. They’re sequences of images, and each sequence might be of different length. And so it might be challenging to represent that entire video as a single vector of values that you could pass in to a neural network. And so here, too, recurrent neural networks can be a valuable solution for trying to solve this type of problem. Then instead of just passing in a single input into our neural network, we could pass in the input one frame at a time, you might imagine. First, taking the first frame of the video, passing it into the network, and then maybe not having the network output anything at all yet. Let it take in another input, and this time, pass it into the network. But the network gets information from the last time we provided an input into the network. Then we pass in a third input, and then a fourth input, where each time, what the network gets is it gets the most recent input, like each frame of the video. But it also gets information the network processed from all of the previous iterations. So on frame number four, you end up getting the input for frame number four plus information the network has calculated from the first three frames. And using all of that data combined, this recurrent neural network can begin to learn how to extract patterns from a sequence of data as well. And so you might imagine, if you want to classify a video into a number of different genres, like an educational video, or a music video, or different types of videos, that’s a classification task, where you want to take as input each of the frames of the video, and you want to output something like what it is, what category that it happens to belong to. And you can imagine doing this sort of thing, this sort of many-to-one learning, any time your input is a sequence. And so input is a sequence in the context of video. It could be in the context of, like, if someone has typed a message and you want to be able to categorize that message, like if you’re trying to take a movie review and trying to classify it as, is it a positive review or a negative review? That input is a sequence of words, and the output is a classification, positive or negative. There, too, a recurrent neural network might be helpful for analyzing sequences of words. And they’re quite popular when it comes to dealing with language. Could even be used for spoken language as well, that spoken language is an audio waveform that can be segmented into distinct chunks. And each of those could be passed in as an input into a recurrent neural network to be able to classify someone’s voice, for instance. If you want to do voice recognition to say, is this one person or is this another, here are also cases where you might want this many-to-one architecture for a recurrent neural network. And then as one final problem, just to take a look at in terms of what we can do with these sorts of networks, imagine what Google Translate is doing. So what Google Translate is doing is it’s taking some text written in one language and converting it into text written in some other language, for example, where now this input is a sequence of data. It’s a sequence of words. And the output is a sequence of words as well. It’s also a sequence. So here we want effectively a many-to-many relationship. Our input is a sequence and our output is a sequence as well. And it’s not quite going to work to just say, take each word in the input and translate it into a word in the output. Because ultimately, different languages put their words in different orders. And maybe one language uses two words for something, whereas another language only uses one. So we really want some way to take this information, this input, encode it somehow, and use that encoding to generate what the output ultimately should be. And this has been one of the big advancements in automated translation technology, is the ability to use the neural networks to do this instead of older, more traditional methods. And this has improved accuracy dramatically. And the way you might imagine doing this is, again, using a recurrent neural network with multiple inputs and multiple outputs. We start by passing in all the input. Input goes into the network. Another input, like another word, goes into the network. And we do this multiple times, like once for each word in the input that I’m trying to translate. And only after all of that is done does the network now start to generate output, like the first word of the translated sentence, and the next word of the translated sentence, so on and so forth, where each time the network passes information to itself by allowing for this model of giving some sort of state from one run in the network to the next run, assembling information about all the inputs, and then passing in information about which part of the output in order to generate next. And there are a number of different types of these sorts of recurrent neural networks. One of the most popular is known as the long short-term memory neural network, otherwise known as LSTM. But in general, these types of networks can be very, very powerful whenever we’re dealing with sequences, whether those are sequences of images or especially sequences of words when it comes towards dealing with natural language. And so that then were just some of the different types of neural networks that can be used to do all sorts of different computations. And these are incredibly versatile tools that can be applied to a number of different domains. We only looked at a couple of the most popular types of neural networks from more traditional feed-forward neural networks, convolutional neural networks, and recurrent neural networks. But there are other types as well. There are adversarial networks where networks compete with each other to try and be able to generate new types of data, as well as other networks that can solve other tasks based on what they happen to be structured and adapted for. And these are very powerful tools in machine learning from being able to very easily learn based on some set of input data and to be able to, therefore, figure out how to calculate some function from inputs to outputs, whether it’s input to some sort of classification like analyzing an image and getting a digit or machine translation where the input is in one language and the output is in another. These tools have a lot of applications for machine learning more generally. Next time, we’ll look at machine learning and AI in particular in the context of natural language. We talked a little bit about this today, but looking at how it is that our AI can begin to understand natural language and can begin to be able to analyze and do useful tasks with regards to human language, which turns out to be a challenging and interesting task. So we’ll see you next time. And welcome back, everybody, to our final class in an introduction to artificial intelligence with Python. Now, so far in this class, we’ve been taking problems that we want to solve intelligently and framing them in ways that computers are going to be able to make sense of. We’ve been taking problems and framing them as search problems or constraint satisfaction problems or optimization problems, for example. In essence, we have been trying to communicate about problems in ways that our computer is going to be able to understand. Today, the goal is going to be to get computers to understand the way you and I communicate naturally via our own natural languages, languages like English. But natural language contains a lot of nuance and complexity that’s going to make it challenging for computers to be able to understand. So we’ll need to explore some new tools and some new techniques to allow computers to make sense of natural language. So what is it exactly that we’re trying to get computers to do? Well, they all fall under this general heading of natural language processing, getting computers to work with natural language. And these tasks include tasks like automatic summarization. Given a long text, can we train the computer to be able to come up with a shorter representation of it? Information extraction, getting the computer to pull out relevant facts or details out of some text. Machine translation, like Google Translate, translating some text from one language into another language. Question answering, if you’ve ever asked a question to your phone or had a conversation with an AI chatbot where you provide some text to the computer, the computer is able to understand that text and then generate some text in response. Text classification, where we provide some text to the computer and the computer assigns it a label, positive or negative, inbox or spam, for example. And there are several other kinds of tasks that all fall under this heading of natural language processing. But before we take a look at how the computer might try to solve these kinds of tasks, it might be useful for us to think about language in general. What are the kinds of challenges that we might need to deal with as we start to think about language and getting a computer to be able to understand it? So one part of language that we’ll need to consider is the syntax of language. Syntax is all about the structure of language. Language is composed of individual words. And those words are composed together in some kind of structured whole. And if our computer is going to be able to understand language, it’s going to need to understand something about that structure. So let’s take a couple of examples. Here, for instance, is a sentence. Just before 9 o’clock, Sherlock Holmes stepped briskly into the room. That sentence is made up of words. And those words together form a structured whole. This is syntactically valid as a sentence. But we could take some of those same words, rearrange them, and come up with a sentence that is not syntactically valid. Here, for example, just before Sherlock Holmes 9 o’clock stepped briskly the room is still composed of valid words. But they’re not in any kind of logical whole. This is not a syntactically well-formed sentence. Another interesting challenge is that some sentences will have multiple possible valid structures. Here’s a sentence, for example. I saw the man on the mountain with a telescope. And here, this is a valid sentence. But it actually has two different possible structures that lend themselves to two different interpretations and two different meanings. Maybe I, the one doing the seeing, am the one with the telescope. Or maybe the man on the mountain is the one with the telescope. And so natural language is ambiguous. Sometimes the same sentence can be interpreted in multiple ways. And that’s something that we’ll need to think about as well. And this lends itself to another problem within language that we’ll need to think about, which is semantics. While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means. And so semantics is going to concern itself with the meaning of words and the meaning of sentences. So if we go back to that same sentence as before, just before 9 o’clock, Sherlock Holmes stepped briskly into the room, I could come up with another sentence, say the sentence, a few minutes before 9, Sherlock Holmes walked quickly into the room. And those are two different sentences with some of the words the same and some of the words different. But the two sentences have essentially the same meaning. And so ideally, whatever model we build, we’ll be able to understand that these two sentences, while different, mean something very similar. Some syntactically well-formed sentences don’t mean anything at all. A famous example from linguist Noam Chomsky is the sentence, colorless green ideas sleep furiously. This is a syntactically, structurally well-formed sentence. We’ve got adjectives modifying a noun, ideas. We’ve got a verb and an adverb in the correct positions. But when taken as a whole, the sentence doesn’t really mean anything. And so if our computers are going to be able to work with natural language and perform tasks in natural language processing, these are some concerns we’ll need to think about. We’ll need to be thinking about syntax. And we’ll need to be thinking about semantics. So how could we go about trying to teach a computer how to understand the structure of natural language? Well, one approach we might take is by starting by thinking about the rules of natural language. Our natural languages have rules. In English, for example, nouns tend to come before verbs. Nouns can be modified by adjectives, for example. And so if only we could formalize those rules, then we could give those rules to a computer, and the computer would be able to make sense of them and understand them. And so let’s try to do exactly that. We’re going to try to define a formal grammar. Where a formal grammar is some system of rules for generating sentences in a language. This is going to be a rule-based approach to natural language processing. We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language. And there are a number of different types of formal grammars. Each one of them has slightly different use cases. But today, we’re going to focus specifically on one kind of grammar known as a context-free grammar. So how does the context-free grammar work? Well, here is a sentence that we might want a computer to generate. She saw the city. And we’re going to call each of these words a terminal symbol. A terminal symbol, because once our computer has generated the word, there’s nothing else for it to generate. Once it’s generated the sentence, the computer is done. We’re going to associate each of these terminal symbols with a non-terminal symbol that generates it. So here we’ve got n, which stands for noun, like she or city. We’ve got v as a non-terminal symbol, which stands for a verb. And then we have d, which stands for determiner. A determiner is a word like the or a or an in English, for example. So each of these non-terminal symbols can generate the terminal symbols that we ultimately care about generating. But how do we know, or how does the computer know which non-terminal symbols are associated with which terminal symbols? Well, to do that, we need some kind of rule. Here are some what we call rewriting rules that have a non-terminal symbol on the left-hand side of an arrow. And on the right side is what that non-terminal symbol can be replaced with. So here we’re saying the non-terminal symbol n, again, which stands for noun, could be replaced by any of these options separated by vertical bars. n could be replaced by she or city or car or hairy. d for determiner could be replaced by the a or an and so forth. Each of these non-terminal symbols could be replaced by any of these words. We can also have non-terminal symbols that are replaced by other non-terminal symbols. Here is an interesting rule, np arrow n bar dn. So what does that mean? Well, np stands for a noun phrase. Sometimes when we have a noun phrase in a sentence, it’s not just a single word, it could be multiple words. And so here we’re saying a noun phrase could be just a noun, or it could be a determiner followed by a noun. So we might have a noun phrase that’s just a noun, like she, that’s a noun phrase. Or we could have a noun phrase that’s multiple words, something like the city also acts as a noun phrase. But in this case, it’s composed of two words, a determiner, the, and a noun city. We could do the same for verb phrases. A verb phrase, or VP, might be just a verb, or it might be a verb followed by a noun phrase. So we could have a verb phrase that’s just a single word, like the word walked, or we could have a verb phrase that is an entire phrase, something like saw the city, as an entire verb phrase. A sentence, meanwhile, we might then define as a noun phrase followed by a verb phrase. And so this would allow us to generate a sentence like she saw the city, an entire sentence made up of a noun phrase, which is just the word she, and then a verb phrase, which is saw the city, saw which is a verb, and then the city, which itself is also a noun phrase. And so if we could give these rules to a computer explaining to it what non-terminal symbols could be replaced by what other symbols, then a computer could take a sentence and begin to understand the structure of that sentence. And so let’s take a look at an example of how we might do that. And to do that, we’re going to use a Python library called NLTK, or the Natural Language Toolkit, which we’ll see a couple of times today. It contains a lot of helpful features and functions that we can use for trying to deal with and process natural language. So here we’ll take a look at how we can use NLTK in order to parse a context-free grammar. So let’s go ahead and open up cfg0.py, cfg standing for context-free grammar. And what you’ll see in this file is that I first import NLTK, the Natural Language Toolkit. And the first thing I do is define a context-free grammar, saying that a sentence is a noun phrase followed by a verb phrase. I’m defining what a noun phrase is, defining what a verb phrase is, and then giving some examples of what I can do with these non-terminal symbols, D for determiner, N for noun, and V for verb. We’re going to use NLTK to parse that grammar. Then we’ll ask the user for some input in the form of a sentence and split it into words. And then we’ll use this context-free grammar parser to try to parse that sentence and print out the resulting syntax tree. So let’s take a look at an example. We’ll go ahead and go into my cfg directory, and we’ll run cfg0.py. And here I’m asked to type in a sentence. Let’s say I type in she walked. And when I do that, I see that she walked is a valid sentence, where she is a noun phrase, and walked is the corresponding verb phrase. I could try to do this with a more complex sentence too. I could do something like she saw the city. And here we see that she is the noun phrase, and then saw the city is the entire verb phrase that makes up this sentence. So that was a very simple grammar. Let’s take a look at a slightly more complex grammar. Here is cfg1.py, where a sentence is still a noun phrase followed by a verb phrase, but I’ve added some other possible non-terminal symbols too. I have AP for adjective phrase and PP for prepositional phrase. And we specified that we could have an adjective phrase before a noun phrase or a prepositional phrase after a noun, for example. So lots of additional ways that we might try to structure a sentence and interpret and parse one of those resulting sentences. So let’s see that one in action. We’ll go ahead and run cfg1.py with this new grammar. And we’ll try a sentence like she saw the wide street. Here, Python’s NLTK is able to parse that sentence and identify that she saw the wide street has this particular structure, a sentence with a noun phrase and a verb phrase, where that verb phrase has a noun phrase that within it contains an adjective. And so it’s able to get some sense for what the structure of this language actually is. Let’s try another example. Let’s say she saw the dog with the binoculars. And we’ll try that sentence. And here, we get one possible syntax tree, she saw the dog with the binoculars. But notice that this sentence is actually a little bit ambiguous in our own natural language. Who has the binoculars? Is it she who has the binoculars or the dog who has the binoculars? And NLTK is able to identify both possible structures for the sentence. In this case, the dog with the binoculars is an entire noun phrase. It’s all underneath this NP here. So it’s the dog that has the binoculars. But we also got an alternative parse tree, where the dog is just the noun phrase. And with the binoculars is a prepositional phrase modifying saw. So she saw the dog and she used the binoculars in order to see the dog as well. So this allows us to get a sense for the structure of natural language. But it relies on us writing all of these rules. And it would take a lot of effort to write all of the rules for any possible sentence that someone might write or say in the English language. Language is complicated. And as a result, there are going to be some very complex rules. So what else might we try? We might try to take a statistical lens towards approaching this problem of natural language processing. If we were able to give the computer a lot of existing data of sentences written in the English language, what could we try to learn from that data? Well, it might be difficult to try and interpret long pieces of text all at once. So instead, what we might want to do is break up that longer text into smaller pieces of information instead. In particular, we might try to create n-grams out of a longer sequence of text. An n-gram is just some contiguous sequence of n items from a sample of text. It might be n characters in a row or n words in a row, for example. So let’s take a passage from Sherlock Holmes. And let’s look for all of the trigrams. A trigram is an n-gram where n is equal to 3. So in this case, we’re looking for sequences of three words in a row. So the trigrams here would be phrases like how often have. That’s three words in a row. Often have I is another trigram. Have I said, I said to, said to you, to you that. These are all trigrams, sequences of three words that appear in sequence. And if we could give the computer a large corpus of text and have it pull out all of the trigrams in this case, it could get a sense for what sequences of three words tend to appear next to each other in our own natural language and, as a result, get some sense for what the structure of the language actually is. So let’s take a look at an example of that. How can we use NLTK to try to get access to information about n-grams? So here, we’re going to open up ngrams.py. And this is a Python program that’s going to load a corpus of data, just some text files, into our computer’s memory. And then we’re going to use NLTK’s ngrams function, which is going to go through the corpus of text, pulling out all of the ngrams for a particular value of n. And then, by using Python’s counter class, we’re going to figure out what are the most common ngrams inside of this entire corpus of text. And we’re going to need a data set in order to do this. And I’ve prepared a data set of some of the stories of Sherlock Holmes. So it’s just a bunch of text files. A lot of words for it to analyze. And as a result, we’ll get a sense for what sequences of two words or three words that tend to be most common in natural language. So let’s give this a try. We’ll go into my ngrams directory. And we’ll run ngrams.py. We’ll try an n value of 2. So we’re looking for sequences of two words in a row. And we’ll use our corpus of stories from Sherlock Holmes. And when we run this program, we get a list of the most common ngrams where n is equal to 2, otherwise known as a bigram. So the most common one is of the. That’s a sequence of two words that appears quite frequently in natural language. Then in the. And it was. These are all common sequences of two words that appear in a row. Let’s instead now try running ngrams with n equal to 3. Let’s get all of the trigrams and see what we get. And now we see the most common trigrams are it was a. One of the. I think that. These are all sequences of three words that appear quite frequently. And we were able to do this essentially via a process known as tokenization. Tokenization is the process of splitting a sequence of characters into pieces. In this case, we’re splitting a long sequence of text into individual words and then looking at sequences of those words to get a sense for the structure of natural language. So once we’ve done this, once we’ve done the tokenization, once we’ve built up our corpus of ngrams, what can we do with that information? So the one thing that we might try is we could build a Markov chain, which you might recall from when we talked about probability. Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it. And as a result, if we know all of the common ngrams in the English language, what words tend to be associated with what other words in sequence, we can use that to predict what word might come next in a sequence of words. And so we could build a Markov chain for language in order to try to generate natural language that follows the same statistical patterns as some input data. So let’s take a look at that and build a Markov chain for natural language. And as input, I’m going to use the works of William Shakespeare. So here I have a file Shakespeare.txt, which is just a bunch of the works of William Shakespeare. It’s a long text file, so plenty of data to analyze. And here in generator.py, I’m using a third party Python library in order to do this analysis. We’re going to read in the sample of text, and then we’re going to train a Markov model based on that text. And then we’re going to have the Markov chain generate some sentences. We’re going to generate a sentence that doesn’t appear in the original text, but that follows the same statistical patterns that’s generating it based on the ngrams trying to predict what word is likely to come next that we would expect based on those statistical patterns. So we’ll go ahead and go into our Markov directory, run this generator with the works of William Shakespeare’s input. And what we’re going to get are five new sentences, where these sentences are not necessarily sentences from the original input text itself, but just that follow the same statistical patterns. It’s predicting what word is likely to come next based on the input data that we’ve seen and the types of words that tend to appear in sequence there too. And so we’re able to generate these sentences. Of course, so far, there’s no guarantee that any of the sentences that are generated actually mean anything or make any sense. They just happen to follow the statistical patterns that our computer is already aware of. So we’ll return to this issue of how to generate text in perhaps a more accurate or more meaningful way a little bit later. So let’s now turn our attention to a slightly different problem, and that’s the problem of text classification. Text classification is the problem where we have some text and we want to put that text into some kind of category. We want to apply some sort of label to that text. And this kind of problem shows up in a wide variety of places. A commonplace might be your email inbox, for example. You get an email and you want your computer to be able to identify whether the email belongs in your inbox or whether it should be filtered out into spam. So we need to classify the text. Is it a good email or is it spam? Another common use case is sentiment analysis. We might want to know whether the sentiment of some text is positive or negative. And so how might we do that? This comes up in situations like product reviews, where we might have a bunch of reviews for a product on some website. My grandson loved it so much fun. Product broke after a few days. One of the best games I’ve played in a long time and kind of cheap and flimsy, not worth it. Here’s some example sentences that you might see on a product review website. And you and I could pretty easily look at this list of product reviews and decide which ones are positive and which ones are negative. We might say the first one and the third one, those seem like positive sentiment messages. But the second one and the fourth one seem like negative sentiment messages. But how did we know that? And how could we train a computer to be able to figure that out as well? Well, you might have clued your eye in on particular key words, where those particular words tend to mean something positive or negative. So you might have identified words like loved and fun and best tend to be associated with positive messages. And words like broke and cheap and flimsy tend to be associated with negative messages. So if only we could train a computer to be able to learn what words tend to be associated with positive versus negative messages, then maybe we could train a computer to do this kind of sentiment analysis as well. So we’re going to try to do just that. We’re going to use a model known as the bag of words model, which is a model that represents text as just an unordered collection of words. For the purpose of this model, we’re not going to worry about the sequence and the ordering of the words, which word came first, second, or third. We’re just going to treat the text as a collection of words in no particular order. And we’re losing information there, right? The order of words is important. And we’ll come back to that a little bit later. But for now, to simplify our model, it’ll help us tremendously just to think about text as some unordered collection of words. And in particular, we’re going to use the bag of words model to build something known as a naive Bayes classifier. So what is a naive Bayes classifier? Well, it’s a tool that’s going to allow us to classify text based on Bayes rule, again, which you might remember from when we talked about probability. Bayes rule says that the probability of B given A is equal to the probability of A given B multiplied by the probability of B divided by the probability of A. So how are we going to use this rule to be able to analyze text? Well, what are we interested in? We’re interested in the probability that a message has a positive sentiment and the probability that a message has a negative sentiment, which I’m here for simplicity going to represent just with these emoji, happy face and frown face, as positive and negative sentiment. And so if I had a review, something like my grandson loved it, then what I’m interested in is not just the probability that a message has positive sentiment, but the conditional probability that a message has positive sentiment given that this is the message my grandson loved it. But how do I go about calculating this value, the probability that the message is positive given that the review is this sequence of words? Well, here’s where the bag of words model comes in. Rather than treat this review as a string of a sequence of words in order, we’re just going to treat it as an unordered collection of words. We’re going to try to calculate the probability that the review is positive given that all of these words, my grandson loved it, are in the review in no particular order, just this unordered collection of words. And this is a conditional probability, which we can then apply Bayes rule to try to make sense of. And so according to Bayes rule, this conditional probability is equal to what? It’s equal to the probability that all of these four words are in the review given that the review is positive multiplied by the probability that the review is positive divided by the probability that all of these words happen to be in the review. So this is the value now that we’re going to try to calculate. Now, one thing you might notice is that the denominator here, the probability that all of these words appear in the review, doesn’t actually depend on whether or not we’re looking at the positive sentiment or negative sentiment case. So we can actually get rid of this denominator. We don’t need to calculate it. We can just say that this probability is proportional to the numerator. And then at the end, we’re going to need to normalize the probability distribution to make sure that all of the values sum up to the value 1. So now, how do we calculate this value? Well, this is the probability of all of these words given positive times probability of positive. And that, by the definition of joint probability, is just one big joint probability, the probability that all of these things are the case, that it’s a positive review, and that all four of these words are in the review. But still, it’s not entirely obvious how we calculate that value. And here is where we need to make one more assumption. And this is where the naive part of naive Bayes comes in. We’re going to make the assumption that all of the words are independent of each other. And by that, I mean that if the word grandson is in the review, that doesn’t change the probability that the word loved is in the review or that the word it is in the review, for example. And in practice, this assumption might not be true. It’s almost certainly the case that the probability of words do depend on each other. But it’s going to simplify our analysis and still give us reasonably good results just to assume that the words are independent of each other and they only depend on whether it’s positive or negative. You might, for example, expect the word loved to appear more often in a positive review than in a negative review. So what does that mean? Well, if we make this assumption, then we can say that this value, the probability we’re interested in, is not directly proportional to, but it’s naively proportional to this value. The probability that the review is positive times the probability that my is in the review, given that it’s positive, times the probability that grandson is in the review, given that it’s positive, and so on for the other two words that happen to be in this review. And now this value, which looks a little more complex, is actually a value that we can calculate pretty easily. So how are we going to estimate the probability that the review is positive? Well, if we have some training data, some example data of example reviews where each one has already been labeled as positive or negative, then we can estimate the probability that a review is positive just by counting the number of positive samples and dividing by the total number of samples that we have in our training data. And for the conditional probabilities, the probability of loved, given that it’s positive, well, that’s going to be the number of positive samples with loved in it divided by the total number of positive samples. So let’s take a look at an actual example to see how we could try to calculate these values. Here I’ve put together some sample data. The way to interpret the sample data is that based on the training data, 49% of the reviews are positive, 51% are negative. And then over here in this table, we have some conditional probabilities. And then we have if the review is positive, then there is a 30% chance that my appears in it. And if the review is negative, there is a 20% chance that my appears in it. And based on our training data among the positive reviews, 1% of them contain the word grandson. And among the negative reviews, 2% contain the word grandson. So using this data, let’s try to calculate this value, the value we’re interested in. And to do that, we’ll need to multiply all of these values together. The probability of positive, and then all of these positive conditional probabilities. And when we do that, we get some value. And then we can do the same thing for the negative case. We’re going to do the same thing, take the probability that it’s negative, multiply it by all of these conditional probabilities, and we’re going to get some other value. And now these values don’t sum to one. They’re not a probability distribution yet. But I can normalize them and get some values. And that tells me that we’re going to predict that my grandson loved it. We think there’s a 68% chance, probability 0.68, that that is a positive sentiment review, and 0.32 probability that it’s a negative review. So what problems might we run into here? What could potentially go wrong when doing this kind of analysis in order to analyze whether text has a positive or negative sentiment? Well, a couple of problems might arise. One problem might be, what if the word grandson never appears for any of the positive reviews? If that were the case, then when we try to calculate the value, the probability that we think the review is positive, we’re going to multiply all these values together, and we’re just going to get 0 for the positive case, because we’re all going to ultimately multiply by that 0 value. And so we’re going to say that we think there is no chance that the review is positive because it contains the word grandson. And in our training data, we’ve never seen the word grandson appear in a positive sentiment message before. And that’s probably not the right analysis, because in cases of rare words, it might be the case that in nowhere in our training data did we ever see the word grandson appear in a message that has positive sentiment. So what can we do to solve this problem? Well, one thing we’ll often do is some kind of additive smoothing, where we add some value alpha to each value in our distribution just to smooth out the data a little bit. And a common form of this is Laplace smoothing, where we add 1 to each value in our distribution. In essence, we pretend we’ve seen each value one more time than we actually have. So if we’ve never seen the word grandson for a positive review, we pretend we’ve seen it once. If we’ve seen it once, we pretend we’ve seen it twice, just to avoid the possibility that we might multiply by 0 and as a result, get some results we don’t want in our analysis. So let’s see what this looks like in practice. Let’s try to do some naive Bayes classification in order to classify text as either positive or negative. We’ll take a look at sentiment.py. And what this is going to do is load some sample data into memory, some examples of positive reviews and negative reviews. And then we’re going to train a naive Bayes classifier on all of this training data, training data that includes all of the words we see in positive reviews and all of the words we see in negative reviews. And then we’re going to try to classify some input. And so we’re going to do this based on a corpus of data. I have some example positive reviews. Here are some positive reviews. It was great, so much fun, for example. And then some negative reviews, not worth it, kind of cheap. These are some examples of negative reviews. So now let’s try to run this classifier and see how it would classify particular text as either positive or negative. We’ll go ahead and run our sentiment analysis on this corpus. And we need to provide it with a review. So I’ll say something like, I enjoyed it. And we see that the classifier says there is about a 0.92 probability that we think that this particular review is positive. Let’s try something negative. We’ll try kind of overpriced. And we see that there is a 0.96 probability now that we think that this particular review is negative. And so our naive Bayes classifier has learned what kinds of words tend to appear in positive reviews and what kinds of words tend to appear in negative reviews. And as a result of that, we’ve been able to design a classifier that can predict whether a particular review is positive or negative. And so this definitely is a useful tool that we can use to try and make some predictions. But we had to make some assumptions in order to get there. So what if we want to now try to build some more sophisticated models, use some tools from machine learning to try and take better advantage of language data to be able to draw more accurate conclusions and solve new kinds of tasks and new kinds of problems? Well, we’ve seen a couple of times now that when we want to take some data and take some input, put it in a way that the computer is going to be able to make sense of, it can be helpful to take that data and turn it into numbers, ultimately. And so what we might want to try to do is come up with some word representation, some way to take a word and translate its meaning into numbers. Because, for example, if we wanted to use a neural network to be able to process language, give our language to a neural network and have it make some predictions or perform some analysis there, a neural network takes its input and produces its output a vector of values, a vector of numbers. And so what we might want to do is take our data and somehow take words and convert them into some kind of numeric representation. So how might we do that? How might we take words and turn them into numbers? Let’s take a look at an example. Here’s a sentence, he wrote a book. And let’s say I wanted to take each of those words and turn it into a vector of values. Here’s one way I might do that. We’ll say he is going to be a vector that has a 1 in the first position and the rest of the values are 0. Wrote will have a 1 in the second position and the rest of the values are 0. A has a 1 in the third position with the rest of the value 0. And book has a 1 in the fourth position with the rest of the value 0. So each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0. And so when doing this, we now have a numeric representation for every word and we could pass in those vector representations into a neural network or other models that require some kind of numeric data as input. But this one-hot representation actually has a couple of problems and it’s not ideal for a few reasons. One reason is, here we’re just looking at four words. But if you imagine a vocabulary of thousands of words or more, these vectors are going to get quite long in order to have a distinct vector for every possible word in a vocabulary. And as a result of that, these longer vectors are going to be more difficult to deal with, more difficult to train, and so forth. And so that might be a problem. Another problem is a little bit more subtle. If we want to represent a word as a vector, and in particular the meaning of a word as a vector, then ideally it should be the case that words that have similar meanings should also have similar vector representations, so that they’re close to each other together inside a vector space. But that’s not really going to be the case with these one-hot representations, because if we take some similar words, say the word wrote and the word authored, which means similar things, they have entirely different vector representations. Likewise, book and novel, those two words mean somewhat similar things, but they have entirely different vector representations because they each have a one in some different position. And so that’s not ideal either. So what we might be interested in instead is some kind of distributed representation. A distributed representation is the representation of the meaning of a word distributed across multiple values, instead of just being one-hot with a one in one position. Here is what a distributed representation of words might be. Each word is associated with some vector of values, with the meaning distributed across multiple values, ideally in such a way that similar words have a similar vector representation. But how are we going to come up with those values? Where do those values come from? How can we define the meaning of a word in this distributed sequence of numbers? Well, to do that, we’re going to draw inspiration from a quote from British linguist J.R. Firth, who said, you shall know a word by the company it keeps. In other words, we’re going to define the meaning of a word based on the words that appear around it, the context words around it. Take, for example, this context, for blank he ate. You might wonder, what words could reasonably fill in that blank? Well, it might be words like breakfast or lunch or dinner. All of those could reasonably fill in that blank. And so what we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning. And that’s something our computer could understand and try to learn. A computer could look at a big corpus of text, look at what words tend to appear in similar context to each other, and use that to identify which words have a similar meaning and should therefore appear close to each other inside a vector space. And so one common model for doing this is known as the word to vec model. It’s a model for generating word vectors, a vector representation for every word by looking at data and looking at the context in which a word appears. The idea is going to be this. If you start out with all of the words just in some random position in space and train it on some training data, what the word to vec model will do is start to learn what words appear in similar contexts. And it will move these vectors around in such a way that hopefully words with similar meanings, breakfast, lunch, and dinner, book, memoir, novel, will hopefully appear to be near to each other as vectors as well. So let’s now take a look at what word to vec might look like in practice when implemented in code. What I have here inside of words.txt is a pre-trained model where each of these words has some vector representation trained by word to vec. Each of these words has some sequence of values representing its meaning, hopefully in such a way that similar words are represented by similar vectors. I also have this file vectors.py, which is going to open up the words and form them into a dictionary. And we also define some useful functions like distance to get the distance between two word vectors and closest words to find which words are nearby in terms of having close vectors to each other. And so let’s give this a try. We’ll go ahead and open a Python interpreter. And I’m going to import these vectors. And we might say, all right, what is the vector representation of the word book? And we get this big long vector that represents the word book as a sequence of values. And this sequence of values by itself is not all that meaningful. But it is meaningful in the context of comparing it to other vectors for other words. So we could use this distance function, which is going to get us the distance between two word vectors. And we might say, what is the distance between the vector representation for the word book and the vector representation for the word novel? And we see that it’s 0.34. You can kind of interpret 0 as being really close together and 1 being very far apart. And so now, what is the distance between book and, let’s say, breakfast? Well, book and breakfast are more different from each other than book and novel are. So I would hopefully expect the distance to be larger. And in fact, it is 0.64 approximately. These two words are further away from each other. And what about now the distance between, let’s say, lunch and breakfast? Well, that’s about 0.2. Those are even closer together. They have a meaning that is closer to each other. Another interesting thing we might do is calculate the closest words. We might say, what are the closest words, according to Word2Vec, to the word book? And let’s say, let’s get the 10 closest words. What are the 10 closest vectors to the vector representation for the word book? And when we perform that analysis, we get this list of words. The closest one is book itself, but we also have books plural, and then essay, memoir, essays, novella, anthology, and so on. All of these words mean something similar to the word book, according to Word2Vec, at least, because they have a similar vector representation. So it seems like we’ve done a pretty good job of trying to capture this kind of vector representation of word meaning. One other interesting side effect of Word2Vec is that it’s also able to capture something about the relationships between words as well. Let’s take a look at an example. Here, for instance, are two words, man and king. And these are each represented by Word2Vec as vectors. So what might happen if I subtracted one from the other, calculated the value king minus man? Well, that will be the vector that will take us from man to king, somehow represent this relationship between the vector representation of the word man and the vector representation of the word king. And that’s what this value, king minus man, represents. So what would happen if I took the vector representation of the word woman and added that same value, king minus man, to it? What would we get as the closest word to that, for example? Well, we could try it. Let’s go ahead and go back to our Python interpreter and give this a try. I could say, what is the closest word to the vector representation of the word king minus the representation of the word man plus the representation of the word woman? And we see that the closest word is the word queen. We’ve somehow been able to capture the relationship between king and man. And then when we apply it to the word woman, we get, as the result, the word queen. So Word2Vec has been able to capture not just the words and how they’re similar to each other, but also something about the relationships between words and how those words are connected to each other. So now that we have this vector representation of words, what can we now do with it? Now we can represent words as numbers. And so we might try to pass those words as input to, say, a neural network. Neural networks we’ve seen are very powerful tools for identifying patterns and making predictions. Recall that a neural network you can think of as all of these units. But really what the neural network is doing is taking some input, passing it into the network, and then producing some output. And by providing the neural network with training data, we’re able to update the weights inside of the network so that the neural network can do a more accurate job of translating those inputs into those outputs. And now that we can represent words as numbers that could be the input or output, you could imagine passing a word in as input to a neural network and getting a word as output. And so when might that be useful? One common use for neural networks is in machine translation, when we want to translate text from one language into another, say translate English into French by passing English into the neural network and getting some French output. You might imagine, for instance, that we could take the English word for lamp, pass it into the neural network, get the French word for lamp as output. But in practice, when we’re translating text from one language to another, we’re usually not just interested in translating a single word from one language to another, but a sequence, say a sentence or a paragraph of words. Here, for example, is another paragraph, again taken from Sherlock Holmes, written in English. And what I might want to do is take that entire sentence, pass it into the neural network, and get as output a French translation of the same sentence. But recall that a neural network’s input and output needs to be of some fixed size. And a sentence is not a fixed size. It’s variable. You might have shorter sentences, and you might have longer sentences. So somehow, we need to solve the problem of translating a sequence into another sequence by means of a neural network. And that’s going to be true not only for machine translation, but also for other problems, problems like question answering. If I want to pass as input a question, something like what is the capital of Massachusetts, feed that as input into the neural network, I would hope that what I would get as output is a sentence like the capital is Boston, again, translating some sequence into some other sequence. And if you’ve ever had a conversation with an AI chatbot, or have ever asked your phone a question, it needs to do something like this. It needs to understand the sequence of words that you, the human, provided as input. And then the computer needs to generate some sequence of words as output. So how can we do this? Well, one tool that we can use is the recurrent neural network, which we took a look at last time, which is a way for us to provide a sequence of values to a neural network by running the neural network multiple times. And each time we run the neural network, what we’re going to do is we’re going to keep track of some hidden state. And that hidden state is going to be passed from one run of the neural network to the next run of the neural network, keeping track of all of the relevant information. And so let’s take a look at how we can apply that to something like this. And in particular, we’re going to look at an architecture known as an encoder-decoder architecture, where we’re going to encode this question into some kind of hidden state, and then use a decoder to decode that hidden state into the output that we’re interested in. So what’s that going to look like? We’ll start with the first word, the word what. That goes into our neural network, and it’s going to produce some hidden state. This is some information about the word what that our neural network is going to need to keep track of. Then when the second word comes along, we’re going to feed it into that same encoder neural network, but it’s going to get as input that hidden state as well. So we pass in the second word. We also get the information about the hidden state, and that’s going to continue for the other words in the input. This is going to produce a new hidden state. And so then when we get to the third word, the, that goes into the encoder. It also gets access to the hidden state, and then it produces a new hidden state that gets passed into the next run when we use the word capital. And the same thing is going to repeat for the other words that appear in the input. So of Massachusetts, that produces one final piece of hidden state. Now somehow, we need to signal the fact that we’re done. There’s nothing left in the input. And we typically do this by passing some kind of special token, say an end token, into the neural network. And now the decoding process is going to start. We’re going to generate the word the. But in addition to generating the word the, this decoder network is also going to generate some kind of hidden state. And so what happens the next time? Well, to generate the next word, it might be helpful to know what the first word was. So we might pass the first word the back into the decoder network. It’s going to get as input this hidden state, and it’s going to generate the next word capital. And that’s also going to generate some hidden state. And we’ll repeat that, passing capital into the network to generate the third word is, and then one more time in order to get the fourth word Boston. And at that point, we’re done. But how do we know we’re done? Usually, we’ll do this one more time, pass Boston into the decoder network, and get an output some end token to indicate that that is the end of our input. And so this then is how we could use a recurrent neural network to take some input, encode it into some hidden state, and then use that hidden state to decode it into the output we’re interested in. To visualize it in a slightly different way, we have some input sequence. This is just some sequence of words. That input sequence goes into the encoder, which in this case is a recurrent neural network generating these hidden states along the way until we generate some final hidden state, at which point we start the decoding process. Again, using a recurrent neural network, that’s going to generate the output sequence as well. So we’ve got the encoder, which is encoding the information about the input sequence into this hidden state, and then the decoder, which takes that hidden state and uses it in order to generate the output sequence. But there are some problems. And for many years, this was the state of the art. The recurrent neural network and variance on this approach were some of the best ways we knew in order to perform tasks in natural language processing. But there are some problems that we might want to try to deal with and that have been dealt with over the years to try and improve upon this kind of model. And one problem you might notice happens in this encoder stage. We’ve taken this input sequence, the sequence of words, and encoded it all into this final piece of hidden state. And that final piece of hidden state needs to contain all of the information from the input sequence that we need in order to generate the output sequence. And while that’s possible, it becomes increasingly difficult as the sequence gets larger and larger. For larger and larger input sequences, it’s going to become more and more difficult to store all of the information we need about the input inside this single hidden state piece of context. That’s a lot of information to pack into just a single value. It might be useful for us, when generating output, to not just refer to this one value, but to all of the previous hidden values that have been generated by the encoder. And so that might be useful, but how could we do that? We’ve got a lot of different values. We need to combine them somehow. So you could imagine adding them together, taking the average of them, for example. But doing that would assume that all of these pieces of hidden state are equally important. But that’s not necessarily true either. Some of these pieces of hidden state are going to be more important than others, depending on what word they most closely correspond to. This piece of hidden state very closely corresponds to the first word of the input sequence. This one very closely corresponds to the second word of the input sequence, for example. And some of those are going to be more important than others. To make matters more complicated, depending on which word of the output sequence we’re generating, different input words might be more or less important. And so what we really want is some way to decide for ourselves which of the input values are worth paying attention to, at what point in time. And this is the key idea behind a mechanism known as attention. Attention is all about letting us decide which values are important to pay attention to, when generating, in this case, the next word in our sequence. So let’s take a look at an example of that. Here’s a sentence. What is the capital of Massachusetts? Same sentence as before. And let’s imagine that we were trying to answer that question by generating tokens of output. So what would the output look like? Well, it’s going to look like something like the capital is. And let’s say we’re now trying to generate this last word here. What is that last word? How is the computer going to figure it out? Well, what it’s going to need to do is decide which values it’s going to pay attention to. And so the attention mechanism will allow us to calculate some attention scores for each word, some value corresponding to each word, determining how relevant is it for us to pay attention to that word right now? And in this case, when generating the fourth word of the output sequence, the most important words to pay attention to might be capital and Massachusetts, for example. That those words are going to be particularly relevant. And there are a number of different mechanisms that have been used in order to calculate these attention scores. It could be something as simple as a dot product to see how similar two vectors are, or we could train an entire neural network to calculate these attention scores. But the key idea is that during the training process for our neural network, we’re going to learn how to calculate these attention scores. Our model is going to learn what is important to pay attention to in order to decide what the next word should be. So the result of all of this, calculating these attention scores, is that we can calculate some value, some value for each input word, determining how important is it for us to pay attention to that particular value. And recall that each of these input words is also associated with one of these hidden state context vectors, capturing information about the sentence up to that point, but primarily focused on that word in particular. And so what we can now do is if we have all of these vectors and we have values representing how important is it for us to pay attention to those particular vectors, is we can take a weighted average. We can take all of these vectors, multiply them by their attention scores, and add them up to get some new vector value, which is going to represent the context from the input, but specifically paying attention to the words that we think are most important. And once we’ve done that, that context vector can be fed into our decoder in order to say that the word should be, in this case, Boston. So attention is this very powerful tool that allows any word when we’re trying to decode it to decide which words from the input should we pay attention to in order to determine what’s important for generating the next word of the output. And one of the first places this was really used was in the field of machine translation. Here’s an example of a diagram from the paper that introduced this idea, which was focused on trying to translate English sentences into French sentences. So we have an input English sentence up along the top, and then along the left side, the output French equivalent of that same sentence. And what you see in all of these squares are the attention scores visualized, where a lighter square indicates a higher attention score. And what you’ll notice is that there’s a strong correspondence between the French word and the equivalent English word, that the French word for agreement is really paying attention to the English word for agreement in order to decide what French word should be generated at that point in time. And sometimes you might pay attention to multiple words if you look at the French word for economic. That’s primarily paying attention to the English word for economic, but also paying attention to the English word for European in this case too. And so attention scores are very easy to visualize to get a sense for what is our machine learning model really paying attention to, what information is it using in order to determine what’s important and what’s not in order to determine what the ultimate output token should be. And so when we combine the attention mechanism with a recurrent neural network, we can get very powerful and useful results where we’re able to generate an output sequence by paying attention to the input sequence too. But there are other problems with this approach of using a recurrent neural network as well. In particular, notice that every run of the neural network depends on the output of the previous step. And that was important for getting a sense for the sequence of words and the ordering of those particular words. But we can’t run this unit of the neural network until after we’ve calculated the hidden state from the run before it from the previous input token. And what that means is that it’s very difficult to parallelize this process. That as the input sequence get longer and longer, we might want to use parallelism to try and speed up this process of training the neural network and making sense of all of this language data. But it’s difficult to do that. And it’s slow to do that with a recurrent neural network because all of it needs to be performed in sequence. And that’s become an increasing challenge as we’ve started to get larger and larger language models. The more language data that we have available to us to use to train our machine learning models, the more accurate it can be, the better representation of language it can have, the better understanding it can have, and the better results that we can see. And so we’ve seen this growth of large language models that are using larger and larger data sets. But as a result, they take longer and longer to train. And so this problem that recurrent neural networks are not easy to parallelize has become an increasing problem. And as a result of that, that was one of the main motivations for a different architecture, for thinking about how to deal with natural language. And that’s known as the transformer architecture. And this has been a significant milestone in the world of natural language processing for really increasing how well we can perform these kinds of natural language processing tasks, as well as how quickly we can train a machine learning model to be able to produce effective results. There are a number of different types of transformers in terms of how they work. But what we’re going to take a look at here is the basic architecture for how one might work with a transformer to get a sense for what’s involved and what we’re doing. So let’s start with the model we were looking at before, specifically at this encoder part of our encoder-decoder architecture, where we used a recurrent neural network to take this input sequence and capture all of this information about the hidden state and the information we need to know about that input sequence. Right now, it all needs to happen in this linear progression. But what the transformer is going to allow us to do is process each of the words independently in a way that’s easy to parallelize, rather than have each word wait for some other word. Each word is going to go through this same neural network and produce some kind of encoded representation of that particular input word. And all of this is going to happen in parallel. Now, it’s happening for all of the words at once, but we’re really just going to focus on what’s happening for one word to make it clear. But know that whatever you’re seeing happen for this one word is going to happen for all of the other input words, too. So what’s going on here? Well, we start with some input word. That input word goes into the neural network. And the output is hopefully some encoded representation of the input word, the information we need to know about the input word that’s going to be relevant to us as we’re generating the output. And because we’re doing this each word independently, it’s easy to parallelize. We don’t have to wait for the previous word before we run this word through the neural network. But what did we lose in this process by trying to parallelize this whole thing? Well, we’ve lost all notion of word ordering. The order of words is important. The sentence, Sherlock Holmes gave the book to Watson, has a different meaning than Watson gave the book to Sherlock Holmes. And so we want to keep track of that information about word position. In the recurrent neural network, that happened for us automatically because we could run each word one at a time through the neural network, get the hidden state, pass it on to the next run of the neural network. But that’s not the case here with the transformer, where each word is being processed independent of all of the other ones. So what are we going to do to try to solve that problem? One thing we can do is add some kind of positional encoding to the input word. The positional encoding is some vector that represents the position of the word in the sentence. This is the first word, the second word, the third word, and so forth. We’re going to add that to the input word. And the result of that is going to be a vector that captures multiple pieces of information. It captures the input word itself as well as where in the sentence it appears. The result of that is we can pass the output of that addition, the addition of the input word and the positional encoding into the neural network. That way, the neural network knows the word and where it appears in the sentence and can use both of those pieces of information to determine how best to represent the meaning of that word in the encoded representation at the end of it. In addition to what we have here, in addition to the positional encoding and this feed forward neural network, we’re also going to add one additional component, which is going to be a self-attention step. This is going to be attention where we’re paying attention to the other input words. Because the meaning or interpretation of an input word might vary depending on the other words in the input as well. And so we’re going to allow each word in the input to decide what other words in the input it should pay attention to in order to decide on its encoded representation. And that’s going to allow us to get a better encoded representation for each word because words are defined by their context, by the words around them and how they’re used in that particular context. This kind of self-attention is so valuable, in fact, that oftentimes the transformer will use multiple different self-attention layers at the same time to allow for this model to be able to pay attention to multiple facets of the input at the same time. And we call this multi-headed attention, where each attention head can pay attention to something different. And as a result, this network can learn to pay attention to many different parts of the input for this input word all at the same time. And in the spirit of deep learning, these two steps, this multi-headed self-attention layer and this neural network layer, that itself can be repeated multiple times, too, in order to get a deeper representation, in order to learn deeper patterns within the input text and ultimately get a better representation of language in order to get useful encoded representations of all of the input words. And so this is the process that a transformer might use in order to take an input word and get it its encoded representation. And the key idea is to really rely on this attention step in order to get information that’s useful in order to determine how to encode that word. And that process is going to repeat for all of the input words that are in the input sequence. We’re going to take all of the input words, encode them with some kind of positional encoding, feed those into these self-attention and feed-forward neural networks in order to ultimately get these encoded representations of the words. That’s the result of the encoder. We get all of these encoded representations that will be useful to us when it comes time then to try to decode all of this information into the output sequence we’re interested in. And again, this might take place in the context of machine translation, where the output is going to be the same sentence in a different language, or it might be an answer to a question in the case of an AI chatbot, for example. And so now let’s take a look at how that decoder is going to work. Ultimately, it’s going to have a very similar structure. Any time we’re trying to generate the next output word, we need to know what the previous output word is, as well as its positional encoding. Where in the output sequence are we? And we’re going to have these same steps, self-attention, because we might want an output word to be able to pay attention to other words in that same output, as well as a neural network. And that might itself repeat multiple times. But in this decoder, we’re going to add one additional step. We’re going to add an additional attention step, where instead of self-attention, where the output word is going to pay attention to other output words, in this step, we’re going to allow the output word to pay attention to the encoded representations. So recall that the encoder is taking all of the input words and transforming them into these encoded representations of all of the input words. But it’s going to be important for us to be able to decide which of those encoded representations we want to pay attention to when generating any particular token in the output sequence. And that’s what this additional attention step is going to allow us to do. It’s saying that every time we’re generating a word of the output, we can pay attention to the other words in the output, because we might want to know, what are the words we’ve generated previously? And we want to pay attention to some of them to decide what word is going to be next in the sequence. But we also care about paying attention to the input words, too. And we want the ability to decide which of these encoded representations of the input words are going to be relevant in order for us to generate the next step. And so these two pieces combine together. We have this encoder that takes all of the input words and produces this encoded representation. And we have this decoder that is able to take the previous output word, pay attention to that encoded input, and then generate the next output word. And this is one of the possible architectures we could use for a transformer, with the key idea being these attention steps that allow words to pay attention to each other. During the training process here, we can now much more easily parallelize this, because we don’t have to wait for all of the words to happen in sequence. And we can learn how we should perform these attention steps. The model is able to learn what is important to pay attention to, what things do I need to pay attention to, in order to be more accurate at predicting what the output word is. And this has proved to be a tremendously effective model for conversational AI agents, for building machine translation systems. And there have been many variants proposed on this model, too. Some transformers only use an encoder. Some only use a decoder. Some use some other combination of these different particular features. But the key ideas ultimately remain the same, this real focus on trying to pay attention to what is most important. And the world of natural language processing is fast growing and fast evolving. Year after year, we keep coming up with new models that allow us to do an even better job of performing these natural language related tasks, all on the surface of solving the tricky problem, which is our own natural language. We’ve seen how the syntax and semantics of our language is ambiguous, and it introduces all of these new challenges that we need to think about, if we’re going to be able to design AI agents that are able to work with language effectively. So as we think about where we’ve been in this class, all of the different types of artificial intelligence we’ve considered, we’ve looked at artificial intelligence in a wide variety of different forms now. We started by taking a look at search problems, where we looked at how AI can search for solutions, play games, and find the optimal decision to make. We talked about knowledge, how AI can represent information that it knows and use that information to generate new knowledge as well. Then we looked at what AI can do when it’s less certain, when it doesn’t know things for sure, and we have to represent things in terms of probability. We then took a look at optimization problems. We saw how a lot of problems in AI can be boiled down to trying to maximize or minimize some function. And we looked at strategies that AI can use in order to do that kind of maximizing and minimizing. We then looked at the world of machine learning, learning from data in order to figure out some patterns and identify how to perform a task by looking at the training data that we have available to it. And one of the most powerful tools there was the neural network, the sequence of units whose weights can be trained in order to allow us to really effectively go from input to output and predict how to get there by learning these underlying patterns. And then today, we took a look at language itself, trying to understand how can we train the computer to be able to understand our natural language, to be able to understand syntax and semantics, make sense of and generate natural language, which introduces a number of interesting problems too. And we’ve really just scratched the surface of artificial intelligence. There is so much interesting research and interesting new techniques and algorithms and ideas being introduced to try to solve these types of problems. So I hope you enjoyed this exploration into the world of artificial intelligence. A huge thanks to all of the course’s teaching staff and production team for making the class possible. This was an introduction to artificial intelligence with Python.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 23, 2025
Full Stack AI Coffee Shop App with Runpod Deployment
This extensive tutorial details the creation of a complete coffee shop customer service chatbot. It begins with the core concepts of building such a bot, including prompt engineering, Retrieval Augmented Generation (RAG), and agent-based systems, before demonstrating how to implement them. The tutorial explores advanced techniques such as creating a Market Basket Analysis recommendation engine and deploying large language models (LLMs) without local GPUs using Runpod. It covers constructing a React Native mobile application, complete with user interface design based on Figma, Firebase integration, and the incorporation of the developed chatbot functionality for taking and processing customer orders.

Prompt Engineering & Recommendation Systems Study Guide

Quiz:
1. What is prompt engineering? Prompt engineering is the art of crafting effective text prompts to elicit desired responses from large language models (LLMs). It involves designing input text that guides the model to generate specific, accurate, and relevant outputs.
2. Explain the importance of structured output in prompt engineering. Structured output ensures that the LLM’s response adheres to a defined format (e.g., JSON), facilitating easy parsing and integration with other systems or databases. It enhances the usability of the generated content by making it predictable and machine-readable.
3. Describe the “give the model time to think” approach (Chain of Thought) and its benefits. The “give the model time to think” approach, particularly Chain of Thought (CoT), encourages the LLM to reason through a problem step-by-step before providing a final answer. This method significantly improves accuracy by guiding the model through a logical thought process, leading to more reliable results.
4. What is a vector embedding, and how is it used in retrieval-augmented generation (RAG) systems? A vector embedding is a numerical representation of text that captures its semantic meaning. In RAG systems, embeddings are used to compare the user’s query with a knowledge base, retrieving the most relevant information to augment the prompt and improve the quality of the LLM’s response.
5. Explain the concept of confidence in Market Basket Analysis. In Market Basket Analysis, confidence measures the likelihood that a customer who purchased item A (antecedent) will also purchase item B (consequent). It helps determine the probability of a customer buying additional items based on what’s already in their cart.
6. What is the significance of the lift metric in Market Basket Analysis? Lift indicates how much more likely two items are to be bought together than if they were bought randomly and independently. A lift value greater than one suggests a positive association, meaning the items are often purchased together.
7. Briefly describe the Apriori algorithm. The Apriori algorithm is a Market Basket Analysis technique that identifies frequent itemsets in a transaction database. It operates using a bottom-up approach. It starts with one item. Then, it builds to Latte and croissant and then builds the items again.
8. In the context of chatbots and prompt engineering, what is a “guard agent” and what role does it play? A guard agent is a component designed to filter user inputs to ensure they adhere to specific guidelines or policies. It analyzes user prompts and determines whether they are appropriate and safe to process, preventing harmful or irrelevant queries from reaching the core chatbot logic.
9. What is the purpose of a classification agent in a chatbot architecture? A classification agent categorizes user inputs to determine the appropriate agent or module to handle the request. This ensures that each query is routed to the most relevant component, such as a details agent for specific information or an order-taking agent for purchase requests.
10. What is an agent protocol and what is the advantage of using one in chatbot development? An agent protocol defines a standard interface for different agents within a chatbot system, ensuring they can interact seamlessly. Using a protocol allows for flexibility and scalability, making it easier to add, remove, or modify agents without disrupting the overall architecture.
Essay Questions:
1. Discuss the three prompt engineering techniques mentioned in the source material.
2. Explain the concept of Market Basket Analysis.
3. Explain in detail how the “give the model time to think” works, including the Chain of Thought mentioned.
4. Describe the role of the recommendation agent in detail.
5. Explain how using a Docker system enables the portability of code.
Glossary of Key Terms:
- API (Application Programming Interface): A set of rules and specifications that software programs can follow to communicate with each other.
- JSON (JavaScript Object Notation): A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate.
- LLM (Large Language Model): A type of artificial intelligence model that uses deep learning techniques to generate human-like text.
- Prompt Engineering: The process of crafting effective input prompts to elicit desired responses from large language models.
- RAG (Retrieval-Augmented Generation): An AI framework that combines a pre-trained language model with an information retrieval system to improve the accuracy and relevance of generated text.
- Vector Embedding: A numerical representation of text or data that captures its semantic meaning in a high-dimensional space.
- Market Basket Analysis: A data mining technique used to identify associations between items in a transaction database, often used for recommendation systems.
- Confidence (Market Basket Analysis): A metric indicating the likelihood that a customer who purchased item A will also purchase item B.
- Lift (Market Basket Analysis): A metric measuring how much more likely two items are to be bought together than if they were bought independently.
- Apriori Algorithm: A Market Basket Analysis algorithm that identifies frequent itemsets in a transaction database.
- Guard Agent: A component in a chatbot designed to filter user inputs and ensure they adhere to specific guidelines or policies.
- Classification Agent: A module in a chatbot that categorizes user inputs to determine the appropriate agent or module to handle the request.
- Agent Protocol: A standardized interface for different agents within a chatbot system, ensuring seamless interaction and scalability.
- Docker: A platform for developing, shipping, and running applications in containers, which are lightweight, portable, and self-sufficient.
- Expo: A framework for building cross-platform mobile apps with React Native.
- React Native: A JavaScript framework for building native mobile apps.
- CSS (Cascading Style Sheets): A style sheet language used for describing the presentation of a document written in a markup language like HTML.
- Tailwind CSS: A utility-first CSS framework for rapidly styling web applications.
- Firebase: A platform from Google for building and deploying mobile apps.
AI Coffee Shop App: A Development Tutorial

Okay, here’s a detailed briefing document summarizing the main themes and important ideas from the provided text. The text appears to be a transcript of someone walking through a practical coding tutorial, focused on building an AI-powered coffee shop application using Python and React Native. The tutorial covers prompt engineering, recommendation engines, vector databases, conversational agents, and front-end development.

I. Prompt Engineering Techniques:
- Structured Output: The tutorial emphasizes guiding LLMs to provide structured outputs, particularly in JSON format, to facilitate easy parsing and integration with other systems.
- “you can write your output and you can start describing the output should be in a structured Json format… you are not allowed to write anything other than the Json object…the country here is Germany and the capital is Berlin and you can see that it’s a list and inside of it it’s a dictionary and it’s a structured format so if I can extract it and put it in a database quite easily.”
- Structured Input: Structuring input by using titles, backticks, and clear delimiters helps the LLM process information more accurately, especially when dealing with multiple inputs.
- “I try to put some titles and then I can also put some back ticks uh to specify the input itself so to specify that this is the input uh that I’m going to use…now you have the input structured and uh the llm is not like is less likely now to forget about a country.”
- Chain of Thought (Giving Model Time to Think): Encouraging the LLM to reason through the problem step-by-step, rather than directly providing an answer, significantly improves accuracy.
- “the idea behind it is that models are are generating uh like the answers word by word so uh right now if they have to generate this answer output uh it might get the uh wrong result… if you told it to think and if you told it to get the outputs like calculate the output step by step it will have the enough words behind it so that it can be directed into the right direction.”
II. Recommendation Engine:
- Market Basket Analysis: The tutorial implements a recommendation engine using Market Basket Analysis techniques, focusing on metrics like confidence, lift, and support.
- “confidence is the measure of the likelihood of a customer who bought a certain item… then he will also buy another item or a set of items that is called the consequent…we can just sort with confidence in a descending fashion and the thing that has the highest confidence we should recommend to users that is it.”
- A priori Algorithm: The A priori algorithm is used to discover frequent item sets and association rules from transaction data.
- “One of the Market Basket analysis algorithms is called the aiori algorithm and it builds this and builds all those numbers together from the bottom up…it starts off with one item then makes this latte like latte and quas then builds the items again so it builds it from the bottom up approach.”
- Confidence, Lift, and Support:Confidence: Measures the likelihood of a customer buying item B if they bought item A.
- Lift: Indicates how much more likely two items are to be bought together than independently. Lift > 1 means a positive association.
- Support: Indicates the frequency of an item or itemset in the dataset. Items with low support are excluded.
III. Vector Database & Embeddings:
- Embeddings: The tutorial demonstrates using embeddings to represent text data numerically, allowing for semantic similarity searches using cosine similarity.
- “llms have the ability to embed those uh text into numbers…if you subtracted car from motorcycle then you will get one and if you subtracted the car from banana… then you will get 44…we are going to get the closest one so this closer number the smaller the number that it is uh the more similar those two concepts are.”
- Pinecone: Pinecone is used as a vector database for efficient storage and retrieval of embeddings.
- “Vector databases are good because when you search an item you can search by the closest thing… the database itself the vector database itself does this for me.”
- Retrieval Augmented Generation (RAG): The tutorial implements a basic RAG system by retrieving relevant data from the vector database (product descriptions) and injecting it into the user’s prompt.
IV. Conversational Agent Architecture:
- Modular Agent Design: The conversational agent is built with a modular architecture consisting of multiple specialized agents (Guard Agent, Classification Agent, Details Agent, Recommendation Agent, Order Taking Agent).
- Agent Protocol: A standardized protocol enables seamless integration and communication between different agents.
- “we don’t care about the uh the agent… whatever chosen diction like whatever chosen agent from the classification I can just add the thread there then get response I don’t care about anything I don’t care about uh uh which agent is which I just care about that it has the get response.”
- Guard Agent: Filters inappropriate or out-of-scope user input.
- Classification Agent: Determines which agent is most suitable to handle the user’s request.
- Details Agent: Retrieves detailed information about products (e.g., price) from the vector database.
- Recommendation Agent: Provides product recommendations using both A priori analysis and popularity-based methods.
- Order Taking Agent: Guides the user through the order process, handles order details, and integrates with the recommendation agent.
- State Management: The Order Taking Agent manages the conversation state (e.g., current order, step number) to provide context-aware responses.
V. Front-End Development (React Native):
- Expo: Expo is used to simplify React Native development.
- Native Wind: Native Wind simplifies CSS styling within React Native.
- Firebase: Firebase is utilized for real-time database functionality (product data).
- Redux Toolkit: Redux Toolkit is used for state management (cart items).
- Navigation: Expo Router is used for navigation between different screens.
- Context API: React Context API is used to share state and functions across components (cart context).
- Asynchronous Data Fetching: useEffect and async/await are used to fetch data from Firebase.
VI. Deployment (Run pod & Docker):
- Docker: Docker is used to containerize the application for deployment, ensuring consistency and reproducibility across different environments.
- Run pod: Run pod is used to deploy the Docker container.
- API Endpoint: The application is deployed with an API endpoint to receive user requests and return responses.
VII. Key Ideas and Facts:
- Version Control: The tutorial stresses the importance of specifying exact versions of Python packages to ensure consistent results.
- “those are the exact versions that I’m using right now but it might work with other versions as well but I feel like I uh if you like include the versions then you’re going to have the same results that I do.”
- Modular Code: The tutorial emphasizes breaking down the application into smaller, manageable modules (agents, components, functions).
- Importance of Clear Communication: The tutorial highlights the need for clear and concise prompts and instructions to guide LLMs effectively.
- Iterative Development: The tutorial demonstrates an iterative development process, where the application is built and tested incrementally.
In essence, the provided text showcases a comprehensive guide to building a modern, AI-driven application, blending backend logic with frontend design and emphasizing best practices in both development and deployment.

Prompt Engineering, RAG, and Recommendation Engines with Python

1. What are the key Python libraries used for prompt engineering and interacting with language models, and how are they installed?

The key Python libraries mentioned are:
- Pandas: For working with structured data like CSV files.
- python-dotenv: For easily reading environment variables from a .env file.
- OpenAI: For interacting with OpenAI language models.
- mlx-ten: For machine learning tasks, specifically used here for market basket analysis.
- Pinecone: For interacting with the Pinecone vector database.
These libraries can be installed using pip:

pip install pandas python-dotenv openai mlx-ten==0.2.3.0 pinecone==5.3.1

Alternatively, you can create a requirements.txt file listing these libraries and their versions and then install them using:

pip install -r requirements.txt

2. What are the three prompt engineering techniques discussed, and how do they improve language model performance?

The three prompt engineering techniques discussed are:
- Structured Output: Instructing the model to output data in a structured format, such as JSON. This makes the model’s output easier to parse and use in downstream applications or databases.
- Structured Input: Organizing the user’s input into distinct sections using titles, backticks, or triple quotes. This helps the model to better understand the different parts of the input, such as instructions, variables, and requests.
- Chain of Thought: Giving the model time to “think” by prompting it to reason through the problem step-by-step before providing the final answer. This can significantly improve accuracy, especially for complex reasoning tasks.
3. What is Retrieval Augmented Generation (RAG), and how does it work?

Retrieval Augmented Generation (RAG) is a technique for improving the relevance of language model responses by incorporating external knowledge. It involves the following steps:
1. Embedding: Converting text data (e.g., documents, product descriptions) into numerical vector representations called embeddings.
2. Vector Database: Storing these embeddings in a vector database like Pinecone.
3. Retrieval: When a user makes a query, the query is also converted into an embedding. The vector database is then searched to find the embeddings that are most similar to the query embedding.
4. Augmentation: The text data associated with the most similar embeddings is retrieved and added to the user’s prompt.
5. Generation: The language model then uses the augmented prompt to generate a response.
This allows the model to provide more relevant and informative answers by drawing upon external knowledge.

4. What is Market Basket Analysis, and how can it be used to create a recommendation engine?

Market Basket Analysis is a technique for identifying relationships between items that are frequently purchased together. In the context of a recommendation engine, it can be used to suggest items to customers based on what they have already placed in their cart or purchased in the past. Key concepts in Market Basket Analysis include:
- Antecedent: An item already present in the customer’s cart.
- Consequent: An item that is recommended based on the presence of the antecedent.
- Confidence: The probability that a customer who buys the antecedent will also buy the consequent.
- Lift: A measure of how much more likely two items are to be bought together than to be bought randomly and independently. A lift greater than 1 indicates a positive association.
- Support: The frequency with which an item or item set appears in the dataset.
Algorithms like Apriori can be used to identify frequent itemsets and generate association rules based on these metrics.

5. How does the Apriori algorithm work in the context of a recommendation engine?

The Apriori algorithm is used to discover frequent itemsets in a transaction database. It starts by identifying individual items that meet a minimum support threshold. Then, it iteratively combines these items to form larger itemsets, pruning any itemsets that do not meet the support threshold. This process continues until no more frequent itemsets can be found. The algorithm then uses these frequent itemsets to generate association rules, which can be used to make recommendations.

6. What are the different types of agents described, and what are their roles in the coffee shop application?

The different types of agents described are:
- Guard Agent: Responsible for filtering out inappropriate or irrelevant user inputs, ensuring that the conversation stays within the intended scope of the application (e.g., preventing the user from asking math questions).
- Classification Agent: Responsible for determining which agent should handle a given user input based on the content of the message (e.g., routing a question about prices to the Details Agent).
- Details Agent: Responsible for providing detailed information about menu items, such as prices or descriptions. It often utilizes a vector database to retrieve relevant information.
- Order Taking Agent: Responsible for taking customer orders, handling the conversation flow, and confirming the order details.
- Recommendations Agent: Responsible for suggesting additional items to customers based on their current order or past purchase history. It can use techniques like market basket analysis or popular recommendations.
7. What is the Agent Protocol used in this application, and why is it important?

The Agent Protocol defines a standard interface for all agents in the application. This interface typically includes a get_response function that takes a user’s message as input and returns a response. By adhering to this protocol, the application can easily add or remove agents without modifying the core orchestration logic. This promotes modularity, maintainability, and extensibility. It also allows a single point to call any agent response.

8. What are the main steps for deploying this application on RunPod, and what is the purpose of a Dockerfile in this process?

The main steps for deploying the application on RunPod are:
1. Create a RunPod Account and Obtain API Key: This allows you to authenticate your requests to the RunPod API.
2. Prepare the Application Code: Ensure that all the necessary files (Python scripts, models, data) are organized and accessible.
3. Create a Dockerfile: A Dockerfile is a text file that contains instructions for building a Docker image. It specifies the base image, dependencies, and commands needed to run the application.
4. Build the Docker Image: Use the docker build command to create a Docker image from the Dockerfile.
5. Push the Docker Image to a Registry (e.g., Docker Hub): This allows RunPod to access the image.
6. Create a RunPod Endpoint: Use the RunPod API to create a new endpoint, specifying the Docker image, resources (CPU, GPU, memory), and other configuration options.
7. Test the Endpoint: Send requests to the endpoint to ensure that the application is running correctly.
The Dockerfile is crucial because it provides a consistent and reproducible way to package the application and its dependencies. This ensures that the application will run correctly on RunPod, regardless of the underlying infrastructure.

Customer Service Chatbot for Coffee Shops

A customer service chatbot can handle various tasks to improve customer experience and drive sales. Here’s how it works:
- Order Management The chatbot can take orders and provide detailed information about menu items.
- Information Retrieval The chatbot answers questions about the coffee shop, such as its location, working hours, and menu items. It can also provide details about the ingredients of a specific item.
- Recommendation Engine The chatbot can suggest complementary products to users, improving the overall customer experience and driving sales. This is achieved through a Market Basket analysis recommendation engine, which identifies items often bought together and suggests them to customers.
- Irrelevant Conversation Blocking The chatbot is designed to filter out irrelevant conversations. A guard agent detects content not related to the coffee shop and prevents the chatbot from engaging in those topics.
- Personalized Recommendations The chatbot can provide tailored suggestions in a conversational manner, enhancing customer experience and potentially increasing sales.
- Modular Design The chatbot uses an agent-based system composed of distinct components or agents. Each agent handles a specific function, such as taking orders, providing information, filtering out irrelevant conversations, or recommending items. This modular approach allows for easy updates and improvements without affecting the entire system.
- Integration with Recommendation Engine The agent-based system can integrate with external systems like a recommendation engine, allowing agents to incorporate outputs from these resources into the conversation.
- Full-Stack Development The chatbot application includes a React Native application that connects to a Firebase database and Runpod endpoints. This setup allows for dynamic display of items, filtering by category, and real-time interaction with the chatbot.
- Availability The chatbot is available 24/7.
- Upselling The chatbot will try to upsell users based on current orders.
A chatbot can be trained with a dataset of coffee shop transactions to identify which items are popular with specific orders. This enables the chatbot to make informed recommendations and provide a seamless user experience. The use of open-source LLMs like Llama allows for full control over the chatbot, including retraining and customization for specific purposes.

Prompt Engineering Techniques for Enhanced Language Model Output

Prompt engineering techniques can enhance the output of language models, making them more accurate and structured. Here are some techniques that can be used to improve a chatbot’s responses:
- Structured Output You can format the chatbot’s output into a structured format, such as a JSON object, so that other systems can understand it and extract data from it.
- Input Structuring Structuring the input helps to separate it into different sections, such as titles and backticks, making the instructions clearer for the language model.
- Giving the Model Time to Think (Chain of Thought) This involves prompting the model to think step by step to increase the accuracy of its answers. The “Chain of Thought” technique can significantly increase accuracy by directing the language model to reason through the problem, calculating the output step by step. This method guides the language model toward the correct direction, enhancing the accuracy and structure of the output.
- Retrieval Augmented Generation (RAG) RAG helps the model to output information that is not already in its memory. This involves injecting relevant information into the prompt so that the user can get the output from it and respond accordingly. This is particularly useful when the chatbot needs to provide information about a specific coffee shop’s menu or details that it was not initially trained on. Injecting data in the prompt allows the chatbot to retrieve and use information it doesn’t have in its memory. The process involves using embeddings to identify the most relevant data to inject into the prompt. Embeddings are the process of changing text into an array of numbers to measure the similarity between two texts. By converting text into embeddings, mathematical operations can be performed to determine the similarity between different pieces of text.
- System Prompts System prompts define how the chatbot should behave, defining the overall behavior.
- Double Checking JSON Using an agent to double-check the JSON output can guarantee that the format is correct and make the code more robust. This involves having a specialized agent whose sole task is to validate and correct the JSON format, ensuring that the output is parsable and error-free.
Recommendation Engine Training: Market Basket Analysis

A recommendation engine can be trained to provide suggestions to customers, improving their overall experience and potentially increasing sales. One type of recommendation engine is the Market Basket analysis recommendation engine.

Key aspects of recommendation engine training:
- Market Basket Analysis This statistical model identifies which items are most popular with specific orders.
- Association Rule Association refers to how likely two items are to be bought together.
- Support This refers to the popularity of a single item.
- Confidence This indicates the likelihood of buying item Y if item X is purchased.
- Lift Lift measures how much more likely two items are to be bought together compared to buying them individually; a lift of 1 indicates no association, while a lift less than 1 suggests a negative association.
- Apriori Algorithm One Market Basket analysis algorithm builds association rules and calculates support, confidence, and lift from the bottom up, starting with single items and then combining them.
- Popularity Recommendation Engine This involves recommending the most popular items to customers who have not provided any specific order information. It can also recommend the most popular items per category.
To train a recommendation engine, a dataset with coffee shop transactions can be used. This dataset includes transaction numbers, items sold, customer information, and quantities.

Agent-Based Chatbots: Architecture, Design, and Functionality

An agent-based chatbot is built using distinct components called agents, each designed to handle a specific function. This approach makes the chatbot more efficient, accurate, and easier to update. Agent-based systems are used in production environments across various industries.

Key aspects of agent-based chatbots:
- Modular Design Each agent is designed to handle a specific function, such as taking orders, providing information, filtering out irrelevant conversations, or recommending items. This modular approach allows for easy updates and improvements without affecting the entire system.
- Specialized Tasks Assigning specialized tasks to agents is key to producing higher accuracy results.
- Guard Agent A guard agent flags content that is not relevant to the coffee shop. If a user is asking irrelevant questions, the guard agent should respond with a default response, such as offering help with an order.
- Input Classifier An input classifier agent classifies user requests into different categories, such as order, recommendations, or details.
- Details Agent This agent answers questions about the coffee shop, menu items, or other details, using a vector database for information retrieval.
- Order Agent This agent outputs the order in a structured format, which can then be easily integrated into an app.
- Recommendation Agent The recommendation agent connects to a trained recommendation engine to provide relevant suggestions.
- Memory Agents have memory so that it can remember what steps it went through and what the next steps are.
- Orchestration Agent controller orchestrates the communication between agents. It first goes to the guard agent, then to the classification agent, and then chooses an agent based on the classification agent.
React Native App with Firebase and Chatbot: Development Guide

A React Native application can be created to complete the customer service chatbot. It can connect seamlessly to both a Firebase database and Runpod endpoints.

Key features and steps in developing the React Native application:
- Home Screen The home screen can retrieve and display items dynamically from a Firebase database. Users can also filter items by category for easier navigation.
- Item Page An item page allows users to view more information about each product, pulling data directly from the database.
- Card Screen A card screen displays all selected items along with the total price.
- Chatbot Screen A dedicated chatbot screen enables users to interact with the chatbot directly within the application. The chatbot connects to Runpod endpoints.
- Navigation The application uses tabs for easy navigation between the home, orders, and chatbot screens.
- Styling Native Wind can be used to simplify CSS styling.
- Data Fetching Firebase can be used to fetch the product data.
- State Management React’s useState hook is used for managing local component state, such as loading states.
- Context API The Context API can be used to manage global states.
Steps to create the React Native application:
1. Install Node.js Node.js is required to run JavaScript code.
2. Install Expo Go Expo Go allows running the application on a smartphone. It is available on both Google Play and the Apple App Store.
3. Create a New Application with Expo Expo is a library that helps write React Native code with helper packages.
4. Install Dependencies Install necessary packages, such as those for routing.
5. Start the Application Run the application using npx Expo start. If running in WSL, use the –tunnel flag.
6. Install Native Wind Install Native Wind to simplify CSS styling.
7. Configure Tailwind CSS Configure Tailwind CSS by initializing it with npx tailwind init and updating the tailwind.config.js file.
8. Install Firebase Firebase is used to fetch data. Install the necessary Firebase packages using npm.
9. Expo Vector Icons Install Expo Vector icons using npm. These can then be used for things such as the tab icons.
The React Native application can be further enhanced with features such as:
- Cart Context To implement cart functionality, a cart context can be created to store and manage cart items.
- Toast Notifications The react-native-root-toast library can be used to display toast notifications when items are added to the cart.
- Details Page When clicking on an item, the application can direct users to a details page with additional information.
- Message List The application should display messages in a list. The messages can be rendered in a scroll view.
By following these steps, a full-stack React Native application can be created, enabling users to interact with the chatbot, view product details, manage their cart, and place orders.

Build and Deploy an AI Chatbot Using LLMs, Python, RunPod, Hugging Face, and React Native

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 21, 2025
Hyperrealistic Robots: A Technological and Artistic Revolution
The text explores the rapidly advancing field of hyperrealistic robotics, showcasing numerous examples of lifelike robots and dolls from various companies worldwide. These creations utilize advanced AI, sophisticated materials like silicone, and intricate designs to achieve an uncanny resemblance to humans. The robots’ capabilities range from basic interaction to complex emotional responses and even self-repairing skin in some cases. The sources also discuss the ethical considerations and potential applications of this technology across diverse sectors, from healthcare and education to entertainment and companionship. Finally, the text examines the creation of hyperrealistic masks, highlighting their artistic applications and potential for misuse.

Robotics and Hyperrealism: A Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What materials are commonly used to create the skin of hyperrealistic robots, and why are these materials chosen?
2. How does AI enhance the functionality of hyperrealistic robots? Give two specific examples of AI capabilities.
3. Besides personal companionship, what are two other potential applications of hyperrealistic robots mentioned in the text?
4. What is modular design, and why is it important for the advancement of robotics?
5. What are some of the ethical and social questions raised by the development of hyperrealistic robots?
6. Describe the process used to create silicone skin for robots.
7. How do advanced sensors contribute to the realistic behavior of hyperrealistic robots?
8. Besides facial expressions, what are two other ways that hyperrealistic robots are made to move realistically?
9. What is the role of 3D printing in the creation of hyperrealistic robots?
10. What does the text say about how hyperrealistic robots may be used in the future?
Quiz Answer Key
1. Silicone and thermoplastic elastomer are frequently used because they can mimic the softness, elasticity, and texture of human skin. These materials also allow for a realistic appearance and comfortable touch.
2. AI enables robots to recognize emotions and respond to voice commands. They can also adapt their communication style to suit the preferences of individual users, creating a more personalized experience.
3. Hyperrealistic robots have potential applications in education and training, creating realistic interaction scenarios. They can also be used in advertising, entertainment, and as guides in public venues.
4. Modular design refers to creating robots with components that can be easily swapped or upgraded. This allows for easy customization, repairs, and upgrades to keep the technology current.
5. The text raises questions about the ethics of creating robots indistinguishable from humans, how to protect personal data if a robot copies a specific person, and how to avoid overdependence on them.
6. A metal or plastic frame is created and molds are made that correspond to the robot’s body parts. Silicone is poured into the molds, allowed to harden, and then detailed.
7. Advanced sensors allow robots to detect facial expressions, interpret emotions, and respond to voice commands. They also help robots adapt to their environment, enhancing natural interactions with humans.
8. Hyperrealistic robots use complex servo motors, hydraulic systems and motion capture technologies to simulate realistic body movements and gestures.
9. 3D printing allows for the creation of detailed and precise parts for robots, such as skin molds and internal structures. This helps achieve a high degree of realism in their appearance.
10. Hyperrealistic robots may be used in healthcare to help with elderly care and assist those with special needs. They may also be utilized as companions to reduce social isolation or in entertainment as a type of digital performer.
Essay Questions

Instructions: Answer the following questions in an essay format.
1. Analyze the impact of advanced materials like silicone and thermoplastic elastomer on the development of hyperrealistic robots, considering both their benefits and limitations.
2. Discuss the convergence of robotics and artificial intelligence in the creation of hyperrealistic robots. How does AI contribute to a robot’s ability to interact with humans convincingly?
3. Evaluate the potential social and ethical ramifications of increasingly realistic humanoid robots, focusing on issues such as user dependency, privacy, and the definition of human interaction.
4. Compare and contrast the development approaches of different companies and countries involved in creating hyperrealistic robots, noting any specific technological or design trends.
5. Considering the wide range of potential applications of hyperrealistic robots, from personal companionship to professional settings, argue the most beneficial and problematic areas of their implementation in the future.
Glossary of Key Terms
- Artificial Intelligence (AI): The ability of a computer or a robot controlled by a computer to do tasks that are usually done by humans because they require human intelligence and discernment.
- Hyperrealistic: Extremely lifelike; designed to closely resemble a real human being in appearance and behavior.
- Humanoid Robot: A robot with its body shape built to resemble the human body.
- Modular Design: The process of creating a robot with components that can be easily swapped out, upgraded, or customized.
- Servo Motors: Precision motors used to control movement and position in robots, particularly for facial expressions and gestures.
- Silicone: A synthetic polymer used to create the skin of robots due to its flexibility, softness, and ability to mimic human skin.
- Thermoplastic Elastomer: A type of plastic that combines the properties of rubber and plastic, often used in the creation of lifelike robot skin.
- Motion Capture Technology: The process of recording the movement of real-world objects and people, often using sensors. This technology is often used to program realistic motion into robots.
- 3D Printing: A method of creating three-dimensional objects from a digital design using a printer that deposits materials in layers.
- Facial Recognition System: Technology that identifies and verifies a person from a digital image or a video frame.
Hyperrealistic Robots and Masks: A Technological and Ethical Analysis

Okay, here is a detailed briefing document summarizing the key themes, ideas, and facts from the provided text about hyperrealistic robots and masks:

Briefing Document: Hyperrealistic Robotics and Masks

Executive Summary: This document analyzes a collection of text excerpts detailing the development of hyperrealistic robots and masks. The sources highlight advancements in material science, artificial intelligence, and manufacturing techniques used to create incredibly lifelike human replicas. These technologies are being explored for various applications, ranging from companionship and entertainment to medical training and historical preservation. The document also addresses the ethical and social questions raised by such realistic creations.

Main Themes:
- Advancements in Realism: The most prominent theme is the pursuit of extreme realism in both robotic and mask design. This is achieved through:
- Material Innovation: Emphasis on materials like silicone and thermoplastic elastomer for skin-like textures.
- “they are craft Ed from thermoplastic elastomer a substance that mimics the softness and elasticity of human skin”
- “crafted from premium silicone the robot offers a natural skin texture and impressive durability”
- “silicone has high elasticity which allows it to imitate the softness and elasticity of human skin”
- Detailed Craftsmanship: Focus on intricate facial features, including implanted eyebrows and eyelashes, realistic skin texture, and hand-applied makeup.
- “The robot’s intricate details include finely rendered facial and body features along with implanted eyebrows and eyelashes”
- “her face is meticulously crafted with skin texture and fine details that create an astonishingly lifelike appearance”
- Advanced AI and Motion: Integration of AI for natural language processing, emotion recognition, and adaptable interactions, combined with flexible joints and servo motors for smooth, lifelike movements and expressions.
- “Advanced artificial intelligence allows her to engage in conversations answer questions and adapt to the preferences of her user”
- “Servo Motors integrated into her design enable smooth facial expressions and natural gestures”
- “powered by Advanced AI algorithms and high Precision sensors enabling her to recognize emotions respond to voice commands and adapt her interaction based on user preferences”
- Applications and Purposes: The texts showcase diverse applications for hyperrealistic robots and masks:
- Companionship: Robots designed to provide emotional support and personalized interactions, often targeted towards single men or those experiencing social isolation.
- “these robots are designed to serve as genuine companion partners for single men”
- “the company envisions a future where robots like Arya help reduce social isolation and provide support in various scenarios including education and personal assistance”
- Entertainment and Performance: Robots and masks used for performances, advertising, and creating special effects in film and theater.
- “Her build replicates human gestures and faal Expressions making her ideal for tasks in advertising entertainment”
- “these masks have a wide range of applications in the film industry”
- Medical Training: Robots like Makoto are being developed for medical simulations, providing a safe environment for students to practice procedures with lifelike reactions.
- “This robot named Makoto is designed to provide a more lifelike training experience for medical students… programmed to exhibit pain and has a gag reflex”
- Education and Research: Robots are used in educational settings to teach programming and inspire young people in STEM fields, and in research to study human-robot interaction.
- “humanoid robots also play a role in supplementary education for instance they help instructors in robotics clubs by teach teaching programming fundamentals”
- “this makes her the preferred choice for developers and Pioneers seeking to innovate in areas like artificial intelligence and human machine in interaction”
- Historical and Cultural Preservation: Robots used to represent historical figures for educational and cultural experiences.
- “these robots share profound insights with guests embodying personas of legendary figures like artist Pablo Picasso physicist Albert Einstein philosopher Confucius”
- Customer Service: Robots are also being deployed as receptionists and information providers.
- “junko chahira a lifelike humanoid robot has recently started serving as a receptionist and information provider developed by Toshiba”
- Customization and Upgradability: A significant trend is the modular design of robots, allowing for easy customization of appearance, voice, and even functionality, along with the ability to upgrade components over time. * “With highly flexible joints these robots move naturally replicating human gestures with astonishing accuracy and fluidity one of the standout aspects of this collection is the extensive customization it offers” * “her modular design allows for easy customization and upgrades ensuring she stays up toate”
- Ethical and Social Concerns: The creation of highly realistic robots and masks raises several ethical and social questions:
- Dependence on Robots: The potential for over-reliance on robots and the need to preserve the importance of human interaction.
- “how can we avoid dependence on robots and preserve the importance of human interaction”
- Emotional Impact: The effect of these robots on human perception of connection and communication and how these “relationships” may affect human psychology.
- “The emotional impact of encountering Android allu is profound many feel awe and curiosity challenged by how closely she resembles a living being her creators aim not only to push technological boundaries but to explore how robots like her could transform our understanding of connection and communication”
- Ethical Implications of Hyperrealism: Questions about the ethics of creating robots that are virtually indistinguishable from humans, particularly regarding data privacy and the potential for misuse (e.g., in criminal activities using lifelike masks).
- “how ethical is it to create robots that may be indistinguishable from humans how can we protect personal data if the robot is meant to copy a specific person”
- “however their realism raises certain concerns they can be used for criminal purposes such as disguising one’s identity during crimes or deceiving surveillance systems”
- Cost and Accessibility: The price of these hyperrealistic robots and masks varies significantly from a few thousand dollars to over $100,000, reflecting the complexity and advanced technology involved. While some models are intended to be more affordable, the most advanced are still very expensive.
- “The cost of the robot depends on the selected features with an average price of approximately $2,800”
- “valued at $133,000 acha symbolizes the convergence of advanced Robotics and the peak of human-like interaction”
- “valued at a $133,000 acha symbolizes the convergence of advanced Robotics”
- “introducing mess a revolutionary robot doll priced at $1500”
Key Ideas and Facts:
- Leading Countries: Japan, South Korea, the United States, and China are key leaders in the development of humanoid robots.
- “There are several countries and companies on the global stage that are leaders in the development of realistic humanoid robots historically Japan has been one of the leading countries in robotics”
- Living Skin: Researchers are exploring the use of cultivated cells to create “living skin” for robots, which could potentially self-repair.
- “this breakthrough is enabled by living skin engineered from cultivated cells according to experts this skin can even regenerate itself after damage much like natural skin”
- Silicone as a Key Material: Silicone is widely used for robot skin due to its elasticity, durability, and ability to mimic human texture.
- “silicone is the preferred material for creating robot skin for several reasons”
- 3D Printing in Mask Creation: 3D printing is used to create intricate mask designs.
- “The Masks printed on a 3D printer can only be distinguished from real faces by their immobile lips and fixed gaze”
Robot Models and Companies Highlighted:
- Arya (Realbotics): An American-made humanoid robot designed for realistic interaction and companionship.
- Acha (Engineered Arts): A highly sophisticated and expensive UK-made robot known for its ultra-realistic facial movements.
- Mia (Shenzen Fan Real Art Development): A Chinese-made hyperrealistic female robot known for its silicone skin and lifelike movements.
- Camila, Eva (For UD Doll): Groundbreaking robots with advanced AI and meticulous design.
- Elisa (Matt McMullen): A stunning hyperrealistic robot doll noted for blending aesthetic beauty with advanced technology
- Android All You (Japan): Known for its lifelike appearance, silicone skin, and emotional expressiveness.
- Junko Chahira (Toshiba): A humanoid robot designed for use as a receptionist and information provider.
- Hatsuki (Hatsu M): A Japanese robot that blends robotics with anime aesthetics
- Mangi (Double MX): A Chinese hyperrealistic female robot known for it’s customizable design and high quality silicone skin.
- Merang (RZR doll): A hyperrealistic robot featuring advanced AI, a medical grade silicone skin, and a steel skeleton with fluid movement.
- Makoto (T University Hospital, Japan): A medical simulation robot used for training.
- Henry: A human sized AI robot capable of poetry recitals, singing and jokes.
- Allen: A human sized robot designed to be an AI experimentation platform.
- Cleo (Engineered Arts): Part of the Mesmer series of robots, particularly notable for neck movement and sensory capacity.
Mask Creators Highlighted:
- Landon Meyer (Hyperflesh): American Artist known for eerily realistic latex masks.
- Metamorphose Masks: A British company specializing in ultra-realistic silicone masks for various applications.
- Shui Okawara (Kenot): A Japanese shop that produces hyperrealistic 3D masks based on photographs.
- Ruben Orosco Loza: Mexican artist known for astonishingly lifelike sculptures.
Conclusion: The sources indicate that the development of hyperrealistic robots and masks is rapidly advancing, pushing the boundaries of technology and human interaction. While these advancements offer exciting possibilities across various sectors, they also raise important ethical and social questions that need to be considered as this technology becomes more prevalent.

Hyperrealistic Robots and Masks: A Comprehensive FAQ

FAQ on Hyperrealistic Robots and Masks
1. What materials are primarily used to create the realistic skin on humanoid robots and masks, and why are they preferred?
2. Silicone is the primary material used, often in multiple layers with varying densities to mimic the epidermis, dermis, and subcutaneous tissue. It’s favored for its high elasticity, which allows it to replicate the softness and flexibility of human skin. Silicone is also resistant to temperature changes, moisture, and UV radiation, making it durable and suitable for both indoor and outdoor use. Additionally, its inertness makes it safe for contact with human skin. Sometimes latex is used as well for masks.
3. How are the lifelike facial features and expressions achieved in these robots and masks?
4. Lifelike facial features are achieved through detailed sculpting, often using 3D scanning technology to capture precise skin details, pores, and wrinkles. These details are then transferred to a mold where silicone is cast. Facial expressions are created using small, precise servo motors and hydraulic systems which move the underlying structures to mimic muscle movements. Advanced AI algorithms, sensors, and motion capture technology can be used to control these movements, enabling the robots to display a range of emotions and react to human cues. For masks, meticulously applied paint, hair, and other details are added.
5. How does Artificial Intelligence (AI) contribute to the realism and functionality of these robots?
6. AI is crucial for the robots’ ability to understand and respond to human interactions. AI enables speech recognition, allowing the robots to understand voice commands and engage in conversations. AI also powers the robots’ ability to recognize emotions, adapt their communication style to the user’s preferences, learn from interactions, and customize their responses. Machine learning and neural networks are used to train robots to reproduce human expressions, maintain dialogue, and adjust behavior to different situations. This gives the robots the ability to provide personalized companionship, assistance, and educational experiences.
7. What are some of the ethical and social concerns surrounding the development and use of hyperrealistic humanoid robots and masks?
8. Several ethical and social concerns are raised by these technologies. One concern is the potential for creating robots that are indistinguishable from humans, raising questions about their ethical treatment and rights. Another concern is the risk of personal data being compromised if the robot is designed to mimic a specific individual. There are also concerns about dependence on robots and the possible erosion of human-to-human interaction. Additionally, the realism of masks could be exploited for criminal activities, potentially creating new challenges for security and identification.
9. Beyond personal companionship, what other applications are being explored for hyperrealistic humanoid robots?
10. Hyperrealistic humanoid robots are being explored for a wide array of applications. They are being used in medical settings for training, where realistic simulations allow students to practice procedures on a machine that mimics a human body. In the education sector, they can assist instructors, inspire young minds in STEM fields, and create immersive learning environments. Robots are also being deployed in service sectors as concierges, guides, and sales assistants in hotels, airports, and shopping malls. They are considered for advertising, entertainment and for film and theatre productions for various tasks including dangerous ones. These robots are even being studied for their potential to provide support and companionship for the elderly or people with special needs.
11. How are these robots designed for long-term use and potential upgrades?
12. Robots are often designed with a modular structure that allows for easy customization and upgrades. This design feature ensures that the robots can stay at the forefront of technological advancements. Modular components can be replaced or upgraded without the need to replace the entire robot. High-quality materials are selected for durability and resilience.
13. How do the costs of these hyperrealistic robots and masks vary?
14. The costs of these robots and masks vary widely, depending on their complexity, materials, and capabilities. Some more basic robot dolls can be priced around $1,500 to $3,000, while more advanced models with sophisticated AI and lifelike movements can cost tens of thousands, with a few reaching $133,000. High-quality silicone masks are also expensive, priced at hundreds of dollars for less detailed ones and thousands for customized ones. This high cost reflects the significant level of innovation and intricate craftsmanship involved.
15. What are some of the key technological innovations in the production of this generation of robots?
16. Key technological innovations include the use of advanced silicone and other polymer materials to create realistic skin, the integration of sophisticated servo motors and hydraulic systems for fluid human-like movements and expressions, and the development of complex AI algorithms for speech recognition, emotional understanding, and adaptive learning. Also the use of 3D scanning for the creation of skin molds to make hyperrealistic skin. Modular design and AI are also central to these advances.
Lifelike Robots: Design, Applications, and Implications

Lifelike robots are being developed with the goal of creating human-like appearances and interactions. These robots are designed with advanced technology and materials to mimic human features, movements, and emotional responses.

Key Features of Lifelike Robots:
- Appearance: Many of these robots are designed with a focus on replicating the look and feel of human skin using materials such as silicone and thermoplastic elastomer. These materials provide a soft, elastic texture, and can be further enhanced with details like skin texture, pores, wrinkles, and even implanted eyebrows and eyelashes.
- Facial Expressions: Robots use servo motors and hydraulic systems to achieve realistic facial expressions and movements. These mechanisms allow the robots to convey emotions such as smiling, surprise, and curiosity. Some robots are designed to mimic the facial expressions of specific individuals by analyzing video footage of their expressions.
- Movements: Advanced articulated skeletons and flexible joints allow for fluid, lifelike movements and gestures. Motion capture technologies are also used to teach robots realistic gestures and poses.
- Artificial Intelligence: AI is a crucial component, enabling robots to engage in conversations, respond to voice commands, recognize emotions, and adapt to user preferences. Machine learning and neural networks help robots to learn from interactions and tailor their responses.
- Customization: Many robots offer customizable features, including skin tones, eye colors, hairstyles, clothing, and accessories, allowing users to personalize their interactions. Some robots have modular designs that allow for easy upgrades and replacements of individual components.
Examples of Lifelike Robots:
- Arya: A humanoid robot developed by Realbotics, known for its realistic appearance and advanced AI.
- Mia: A hyperrealistic female robot created by Shenzen Fan Real Art Development, featuring a natural skin texture and fluid movements.
- Ada: A hyperrealistic robot doll by Gynoid Dolls, designed with advanced AI and sophisticated sensors.
- Camila: A robot doll by For UD doll with a focus on hyperrealistic design and functional AI.
- Eva: A lifelike robot doll created by Hu and Gynoid Dolls, designed for both companionship and professional purposes.
- Elisa: A hyperrealistic robot doll developed by Matt McMullen, equipped with AI algorithms and sensors for expressive movements.
- Acha: A humanoid robot from Engineered Arts, recognized for ultra-realistic facial movements.
- Xiao B: A hyperrealistic female robot created by Fud Doall, featuring a medical-grade silicone outer layer.
- Android U: A humanoid robot from Japan, known for her lifelike appearance, expressive gestures, and ability to sing.
- Android An: A robot designed for social settings, with realistic facial features and movements.
- Junko Chihira: A humanoid robot from Toshiba, serving as a receptionist and information provider.
- Hatsuki: A Japanese robot that combines robotics with anime aesthetics.
- Mangi: A hyperrealistic female robot by Double MX, with a customizable design.
- Mesmer: A line of robots from Engineered Arts known for realistic interaction.
- Merang: A hyperrealistic female robot by RZR Doll, designed with medical-grade silicone and advanced AI.
- Makoto: A medical simulation robot developed by Japanese researchers, designed to provide lifelike training for medical students.
- Allen: A robot developed by engineer Will Huff, equipped with AI and various capabilities for interacting with its environment.
- Henry: A robot with lifelike features and conversational skills, designed for companionship and social settings.
- Cleo: A humanoid robot by Engineered Arts, known for its authentic movements and facial gestures.
Materials and Manufacturing:
- Silicone: A popular material used for creating robot skin due to its elasticity, durability, and ability to mimic the softness of human skin. Silicone can be molded, painted, and textured to match human skin tones and features.
- 3D printing: Used to create detailed facial features, molds for silicone skin, and components of the robot’s structure.
- Servo motors: Used to create movement for facial expression and gestures.
- Sensors: Used to detect emotions and other environmental inputs.
Applications of Lifelike Robots:
- Companionship: Many robots are designed to serve as companions for single individuals, the elderly, or people with special needs.
- Entertainment: Robots are used in film, theater, and exhibitions, as well as for creating lifelike dolls and mannequins.
- Education and Training: Robots are used to create realistic learning and training environments, including medical simulations and robotics education.
- Customer Service: Robots can act as concierges, guides, and sales assistants in public places.
- Research: Robots are utilized as platforms for pushing the limits of AI and robotics technology, and to study human-robot interaction.
Ethical and Social Implications:
- The development of lifelike robots raises ethical questions about creating robots that may be indistinguishable from humans.
- Concerns exist about the potential for misuse, such as using robots for criminal purposes or deceiving surveillance systems.
- There are also concerns about the impact on human interaction, dependence on robots, and the protection of personal data when robots are designed to copy specific individuals.
The development of lifelike robots is an ongoing process, with new advancements in materials, AI, and robotics constantly pushing the boundaries of what is possible. As the technology continues to evolve, these robots are expected to play an increasing role in various aspects of society.

Advanced AI in Lifelike Robots

Advanced AI is a key component in the development of lifelike robots, enabling them to interact with humans in more natural and meaningful ways. This technology allows robots to go beyond simple programmed responses and adapt to various situations, learn from interactions, and even recognize and respond to human emotions. Here’s a breakdown of how advanced AI is utilized in lifelike robots:
- Conversational Abilities: Advanced AI allows robots to engage in conversations, understand voice commands, and respond with appropriate and context-aware replies. This includes the ability to use natural language processing, which enables them to understand and generate human-like speech. Some robots, like Arya, can participate in smooth and natural conversations and engage seamlessly with people. Some robots can also offer information in multiple languages.
- Emotional Recognition and Response: AI algorithms and sophisticated sensors enable robots to detect human emotions and adapt their interactions based on user preferences. This can include recognizing facial expressions, tone of voice, and other cues to gauge a person’s emotional state and respond in a way that feels empathetic and appropriate. Some robots are even designed to mimic the facial expressions of specific individuals by analyzing video footage of their expressions.
- Personalized Interactions: AI enables robots to learn from their interactions with users, tailoring their responses and behaviors to create a more personalized experience. This can include remembering user preferences, adapting their communication style, and offering customized content or assistance. For example, Arya can learn from interactions, tailoring her responses to create a personalized experience.
- Learning and Adaptation: Machine learning and neural networks are used to train robots to recognize and reproduce human expressions, respond to voice commands, and maintain dialogues. These systems allow robots to adapt to different situations and adjust their behavior to the context. This ability to learn and adapt is crucial for creating robots that can seamlessly integrate into human environments and provide meaningful interactions.
- Integration with Smart Devices: Some robots are designed to be compatible with smart devices like Google Home, Amazon Alexa, and Apple Siri, making them useful companions in modern smart homes. This allows for seamless control and integration with existing technology.
- Facial Movement: Advanced AI algorithms and high-precision sensors power expressive facial movements in robots, enabling them to convey emotions naturally and respond to user interactions. Some robots use servo motors to achieve a wide range of facial expressions, enhancing the realism of their interactions.
- Examples of Robots with Advanced AI:
- Arya: Known for her sophisticated AI systems that enable smooth, natural conversations and seamless engagement with people.
- Ada: Features state-of-the-art AI and sophisticated sensors that allow it to detect emotions, respond to voice commands, and adapt interactions based on user preferences.
- Camila: Equipped with advanced AI and sensitive sensors that allow it to recognize emotions, respond to voice commands, and adapt its interactions.
- Elisa: Uses cutting-edge AI algorithms and sensors to recognize emotions, interpret voice commands, and adapt her interactions to suit user preferences.
- Mesmer robots: Have a sophisticated control system that allows them to react to external stimuli and learn from interactions, making them suitable for environments requiring human-robot interaction.
- Merang: Equipped with advanced AI, allowing her to recognize facial expressions, interpret emotions, and adapt her responses for meaningful interactions.
- Android U: Has an AI that learns from interactions and adapts seamlessly to different environments and needs.
In summary, advanced AI is essential for creating lifelike robots that can interact with humans in a natural, intuitive, and meaningful way. It provides the robots with the ability to understand and respond to human communication, recognize and react to emotions, and adapt their behaviors over time. As AI technology continues to advance, robots are expected to become even more sophisticated and capable of playing an increasingly important role in society.

Humanoid Robots: Design, Applications, and Implications

Humanoid robots are designed to resemble the human body in form and function, and they represent a significant area of development in robotics. These robots often incorporate advanced technologies to mimic human appearance, movement, and interaction, with the aim of creating robots that can seamlessly integrate into human environments.

Key aspects of humanoid robots:
- Appearance: Humanoid robots often feature a lifelike appearance, with a focus on replicating the look and feel of human skin, often using materials such as silicone and thermoplastic elastomer. These materials are chosen for their ability to mimic the softness and elasticity of human skin. Details like skin texture, pores, wrinkles, and even implanted eyebrows and eyelashes are added to enhance realism.
- Facial Expressions: Many humanoid robots use servo motors, micro motors, and hydraulic systems to create realistic facial expressions and movements. Some robots are designed to mimic the facial expressions of specific individuals by analyzing video footage of their expressions. These mechanisms enable robots to convey emotions like smiling, surprise, and curiosity.
- Movements: Humanoid robots feature articulated skeletons and flexible joints that enable fluid, lifelike movements and gestures. Motion capture technologies are also used to teach robots realistic gestures and poses.
- Artificial Intelligence (AI): AI is a critical component of humanoid robots, enabling them to engage in conversations, respond to voice commands, recognize emotions, and adapt to user preferences. Machine learning and neural networks are used to train robots to understand and reproduce human expressions, respond to voice commands, and maintain a dialogue. This allows robots to adapt to different situations and adjust their behavior accordingly.
- Customization: Many humanoid robots offer customizable features, including skin tones, eye colors, hairstyles, clothing, and accessories. Some robots have modular designs that allow for easy upgrades and replacements of individual components.
Examples of Humanoid Robots:
- Arya: Developed by Realbotics, Arya is known for her realistic appearance and advanced AI, which allows her to engage in smooth, natural conversations.
- Mia: Created by Shenzhen Fan Real Art Development, Mia is a hyperrealistic female robot designed with a natural skin texture and fluid movements.
- Ada: A hyperrealistic robot doll by Gynoid Dolls, designed with advanced AI and sophisticated sensors.
- Camila: A robot doll from For UD doll, notable for its hyperrealistic design and functional AI.
- Eva: A lifelike robot doll by Hu and Gynoid Dolls, created for companionship and professional use.
- Elisa: A hyperrealistic robot doll by Matt McMullen that features AI algorithms and sensors for expressive movements.
- Acha: From Engineered Arts, Acha is recognized for her ultra-realistic facial movements.
- Xiao B: A hyperrealistic female robot created by Fud Doall with a medical-grade silicone outer layer.
- Android U: A humanoid robot from Japan, known for her lifelike appearance, expressive gestures, and ability to sing.
- Android An: A robot designed for social settings, with realistic facial features and movements.
- Junko Chihira: A humanoid robot from Toshiba that serves as a receptionist and information provider.
- Hatsuki: A Japanese robot that combines robotics with anime aesthetics.
- Mangi: A hyperrealistic female robot by Double MX, featuring a customizable design.
- Mesmer: A line of robots from Engineered Arts, known for realistic interaction.
- Merang: A hyperrealistic female robot from RZR Doll, designed with medical-grade silicone and advanced AI.
- Makoto: A medical simulation robot designed to provide lifelike training for medical students.
- Allen: A robot equipped with AI, designed by Will Huff, and capable of interacting with its environment.
- Henry: A robot designed for companionship and social settings, with lifelike features and conversational skills.
- Cleo: A humanoid robot by Engineered Arts, known for its authentic movements and facial gestures.
- Yuki Kashiwagi: A humanoid robot modeled as a replica of a Japanese singer.
Materials and Manufacturing:
- Silicone: A commonly used material for creating robot skin due to its elasticity, durability, and ability to mimic the softness of human skin. Silicone can be molded, painted, and textured to match human skin tones and features.
- 3D Printing: Used for creating detailed facial features, molds for silicone skin, and components of the robot’s structure.
- Servo Motors: Used to create movement for facial expressions and gestures.
- Sensors: Used to detect emotions and other environmental inputs.
Applications of Humanoid Robots:
- Companionship: Humanoid robots are designed to serve as companions for single individuals, the elderly, or people with special needs.
- Entertainment: Robots are used in film, theater, and exhibitions, as well as for creating lifelike dolls and mannequins.
- Education and Training: Robots are used to create realistic learning and training environments, including medical simulations and robotics education.
- Customer Service: Humanoid robots can act as concierges, guides, and sales assistants in public places.
- Research: Robots serve as platforms for advancing AI and robotics technology and for studying human-robot interaction.
Ethical and Social Implications:
- The development of humanoid robots raises ethical questions regarding the creation of robots that may be indistinguishable from humans.
- Concerns exist about the potential misuse of these robots, such as for criminal purposes or deceiving surveillance systems.
- There are also concerns about the impact on human interaction, dependence on robots, and the protection of personal data when robots are designed to copy specific individuals.
The development of humanoid robots is a rapidly advancing field, with ongoing innovations in materials, AI, and robotics. As the technology continues to evolve, humanoid robots are expected to play an increasingly significant role in various aspects of society.

Realistic Robot Skin: Materials, Techniques, and Challenges

Realistic skin is a crucial aspect of creating lifelike humanoid robots, and much effort is put into making it look and feel as human as possible. Here’s a breakdown of the key elements and technologies used to achieve realistic skin in robots:

Materials Used for Realistic Skin
- Silicone is the most commonly used material for creating robot skin due to its high elasticity, durability, and ability to mimic the softness of human skin. It can stretch and shrink without breaking, making it ideal for covering moving parts of robots. Silicone is also resistant to temperature changes, moisture, and ultraviolet radiation, enhancing the durability of the robot. Additionally, it is considered safe for contact with human skin, as it is inert and does not cause allergic reactions.
- Thermoplastic Elastomer (TPE) is another material used to create skin for robots. TPE is known for its ability to mimic the softness and elasticity of human skin.
- Latex is another material used in the creation of artificial skin.
- Medical-grade silicone is used for its realistic feel, durability and hypoallergenic properties.
Key Features of Realistic Robot Skin
- Texture and Softness: The goal is to replicate the texture and softness of human skin as closely as possible. This is achieved by using materials like silicone and TPE that have the necessary elasticity and flexibility.
- Color and Appearance: Special pigments and dyes are mixed with silicone to match human skin tones. Additional layers of paint can be added to simulate capillaries, freckles, and other details typical of human skin.
- Fine Details: Techniques like 3D scanning are used to create molds that capture the fine details of human skin, including pores, wrinkles, and other small features. These details are then incorporated into the robot’s skin during the manufacturing process.
- Embedded Sensors: Advanced versions of silicone skin may have embedded sensors that respond to pressure, temperature, and touch. This allows the robot to not only look realistic but also to sense and respond to touch, making interactions more natural.
- Hair and Eyelashes: Synthetic materials are used to create realistic-looking hair and eyelashes which are then installed in the robot. These are processed and colored to look as real as possible.
Manufacturing Process
- Creating a Base Frame: A metal or plastic frame is created that serves as the basis for the robot. This frame is equipped with motors and mechanisms for movement.
- Creating Molds: Molds are created that correspond to the robot’s body parts, such as the face, arms, and legs.
- Applying Silicone: Silicone is poured into these molds and hardened, after which the molds are removed, leaving behind a flexible silicone shell.
- Adding Details: Small details, such as veins, wrinkles, and pores, are added manually or with special tools and 3D printing technologies.
- Painting and Treating: Once the silicone skin is installed, it can be painted or treated to make it look more realistic.
- Adding Hair and Eyelashes: Hair and eyelashes are installed to enhance the realistic appearance.
Advanced Techniques
- 3D Scanning and Printing: 3D scanning technology is used to capture detailed images of human skin, which are then used to create molds for silicone skin. 3D printing is also used to create parts of the robot’s structure and other components.
- Multi-Layered Silicone: To simulate the layers of human skin, several layers of silicone with different densities are used. The top layer is softer, resembling the epidermis, and the lower layers are denser, simulating the dermis and subcutaneous tissue.
- Living Skin: Researchers have developed a type of living skin for robots engineered from cultivated cells, which can self-regenerate after damage. This technology is still in development and has challenges related to hydration and nutrient supply.
Challenges and Limitations:
- Mimicking Facial Movements: Artificial skin doesn’t always move like real skin during expressions such as smiling. Researchers are working to improve this by creating more flexible and responsive materials, and by developing attachment methods that allow for more natural facial expressions.
- Durability: Some synthetic skin materials can loosen or sag from the framework of the robot, and they can also degrade or sustain damage, requiring repairs.
- Integration of Sensors: Integrating sensors within the artificial skin to simulate the human sense of touch is another area of ongoing research.
Examples of Robots with Realistic Skin:
- Arya‘s construction uses cutting-edge materials and techniques, granting her a remarkably human-like appearance with mechanisms that simulate realistic facial expressions and gestures.
- Mia is crafted from premium silicone offering a natural skin texture.
- Xiao B’s outer layer is crafted from medical-grade silicone that replicates the look and feel of human skin.
- Merang is crafted from medical-grade silicone, and her skin feels incredibly realistic, mimicking human texture and softness.
- Android U features silicone skin that mimics human texture.
- Acha is designed with a special silicone material that mimics the look and flexibility of human skin.
- Mesmer robots utilize skin crafted from a special silicone material that mimics the look and flexibility of human skin.
- Elisa is crafted with soft lifelike skin that mimics human texture with unparalleled accuracy.
In summary, creating realistic skin for robots involves a combination of advanced materials, manufacturing techniques, and artistic detailing. The use of silicone, 3D printing, and other technologies, along with attention to detail, has allowed for the creation of robots with remarkably lifelike skin. While challenges remain in achieving perfect realism, especially in areas like facial movement and durability, ongoing research and development are continuously pushing the boundaries of what is possible.

Hyperrealistic Robot Dolls: Technology, Ethics, and Design

Robot dolls are a rapidly evolving area of robotics, combining advanced technology with intricate design to create human-like companions and tools. These dolls often feature realistic appearances, advanced AI, and customizable options.

Key Features of Robot Dolls
- Lifelike Appearance: Many robot dolls are designed with a focus on hyperrealism, utilizing materials like silicone and thermoplastic elastomer to mimic the look and feel of human skin. They often include detailed facial features, such as implanted eyebrows and eyelashes.
- Artificial Intelligence: Robot dolls are often equipped with AI that allows them to interact with users. They can interpret voice commands, recognize emotions, and adapt their communication style to suit individual needs. Some can learn from interactions and tailor their responses to create personalized experiences.
- Customization: Many robot dolls offer extensive customization options, including hairstyles, eye colors, clothing, and accessories. Some have modular designs that allow for easy upgrades and replacements of individual components.
- Movement: These dolls often have highly flexible joints that enable them to move naturally, replicating human gestures with fluidity. Servo motors and internal metal frameworks help them to assume various positions and perform subtle movements like nodding or tilting their heads.
- Durability: Robot dolls are typically built from top-tier materials to ensure they maintain their realistic appearance over time. High-quality materials and modular designs allow for easy replacement or upgrades, ensuring a long lifespan.
Examples of Robot Dolls
- Ultra-Realistic Robotic Dolls (Japanese): These dolls are designed to serve as companion partners for single men and are crafted from thermoplastic elastomer to mimic human skin.
- Arya: Developed by Realbotics, Arya is a humanoid robot designed for realistic and engaging companion experiences. She uses high-quality materials like silicone and thermoplastic elastomer, and her AI allows her to engage in conversations and adapt to user preferences.
- Mia: From Shenzhen Fan Real Art Development, Mia is a hyperrealistic female robot made from premium silicone and featuring a metallic articulated skeleton for fluid movements. She can also be equipped with a voice module.
- For UD Doll Collection: This collection features remarkably lifelike appearances, state-of-the-art AI, and highly flexible joints. The dolls offer extensive customization and are built from top-tier materials for durability.
- Ada: Created by Gynoid Dolls, Ada is a hyperrealistic robot doll with advanced AI, sophisticated sensors, and flexible joints that allow smooth and realistic movements.
- Camila: Another creation by For UD Doll, Camila combines hyperrealistic design with advanced AI and sensors. Her modular design allows for customization and upgrades.
- Eva: From Hu and Gynoid Dolls, Eva features detailed facial features and soft, lifelike skin. Her modular structure offers customization options.
- Elisa: Developed by Matt McMullen, Elisa is a hyperrealistic robot doll with detailed facial features, soft skin, and advanced AI, designed for both personal and professional use.
- Xiao B: Created by Fud Doll, Xiao B is a hyperrealistic female robot made with medical-grade silicone and featuring a fully articulated skeleton.
- Mess: From Hannad Doll, Mess combines realism with basic AI and sensors, responding to voice commands and adapting to user preferences.
Purposes of Robot Dolls
- Companionship: Some robot dolls are designed to serve as companions, especially for individuals seeking a partner. They are intended to help reduce social isolation and provide support.
- Professional Demonstrations: Robot dolls are used for demonstrations, acting as engaging and interactive tools.
- Personal Collections: Many are designed for collectors and enthusiasts, featuring a high level of craftsmanship and detail.
- Entertainment and Advertising: Some robot dolls are designed for public exhibitions, advertising, or entertainment.
Materials and Technology
- Silicone: This material is often used to create realistic skin, providing a soft and flexible texture.
- Thermoplastic Elastomer: This material is also used for creating skin, known for its ability to mimic the softness and elasticity of human skin.
- AI and Sensors: Advanced AI and sensors are used to enable the dolls to understand voice commands, recognize emotions, and adapt to user preferences.
- Servo Motors: These are used to facilitate smooth facial expressions and natural gestures.
- Modular Designs: This allows for easy upgrades and customization, ensuring the dolls stay current with technological advancements.
Ethical and Social Implications:
- The creation of highly realistic robot dolls raises ethical questions about the nature of human-robot relationships, the potential for dependence on robots, and the preservation of human interaction.
Robot dolls represent a significant advancement in robotics, pushing the boundaries of technology and human interaction. They blend advanced materials and AI with detailed designs to create lifelike and interactive companions and tools.

ALL Japan’s Female Robots That Look Shockingly Realistic

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 20, 2025
Machine Learning: Linear Regression, Q Learning, and CNNs
These sources cover various aspects of machine learning and AI, ranging from fundamental concepts to practical implementations. They discuss different machine learning techniques like supervised, unsupervised, reinforcement learning, clustering (specifically K-means), linear and logistic regression, and anomaly detection. The sources also explore specific algorithms and models, including linear regression, support vector machines, artificial neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs) with LSTM, ridge regression, and lasso regression. Furthermore, they offer code examples and case studies using Python libraries such as scikit-learn, TensorFlow, and Keras, focusing on applications like image classification, stock price prediction, and face mask detection. The sources additionally discuss the evaluation and ranking of large language models (LLMs) using benchmarks and leaderboards, with an emphasis on Hugging Face, and introduces Meta’s Llama 3.2 for private local use.

Machine Learning and Neural Networks Study Guide

Quiz:
1. What is the difference between classification and regression in data science? Classification predicts a category (yes/no, true/false), while regression predicts a numerical quantity based on input features. Classification seeks to predict a discrete value and regression seeks to predict a continuous value.
2. Explain the concept of anomaly detection and provide an example. Anomaly detection identifies unusual patterns or data points that deviate significantly from the norm. Detecting fraudulent transactions or unusual stock market activity are good examples.
3. What is clustering, and how is it used in data science? Clustering is an unsupervised learning technique that groups data points with similar characteristics together. This is valuable for market segmentation or discovering hidden structures in data.
4. In linear regression, what do ‘m’ and ‘C’ represent in the equation y = mx + C? ‘m’ represents the slope of the regression line, indicating the rate of change in y for each unit change in x. ‘C’ represents the y-intercept, the point where the line crosses the y-axis.
5. What is a hyperplane, and how is it used in support vector machines (SVMs)? A hyperplane is a decision boundary that separates data points into different classes in an SVM. In higher dimensions, it is a generalization of a line or plane.
6. Describe the role of kernel in SVM. The kernel trick maps data into a higher-dimensional space where it is easier to separate, even if the data is not linearly separable in its original space. A linear kernel indicates the data is linearly separable.
7. Why is it necessary to format and pre-process data before using it in a machine-learning model? Pre-processing ensures data is in a suitable format for the model, handles missing values, and scales features to prevent bias. This increases the model’s performance and accuracy.
8. Explain the concept of temporal difference in Q-learning. Temporal difference learning is a method of learning by estimating the value function (Q-value) based on the difference between the current estimate and the new estimate of the Q-value, leveraging immediate rewards and the agent’s experience. The current reward which is observed from the environment in response to the current action.
9. In K-means clustering, what does the ‘K’ represent, and why is it important to choose an appropriate value for ‘K’? ‘K’ represents the number of clusters to form in the data. Choosing the right value is crucial because it directly affects how the data is grouped and can significantly impact the interpretability and usefulness of the clusters.
10. Explain the elbow method in the context of K-means clustering. The elbow method is a heuristic used to determine the optimal number of clusters (‘K’) by plotting the within-cluster sum of squares (WCSS) against different values of K. The “elbow” point on the graph, where the rate of decrease in WCSS slows down, suggests a good balance between cluster compactness and the number of clusters.
Answer Key:
1. Classification predicts a category (yes/no, true/false), while regression predicts a numerical quantity based on input features. Classification seeks to predict a discrete value and regression seeks to predict a continuous value.
2. Anomaly detection identifies unusual patterns or data points that deviate significantly from the norm. Detecting fraudulent transactions or unusual stock market activity are good examples.
3. Clustering is an unsupervised learning technique that groups data points with similar characteristics together. This is valuable for market segmentation or discovering hidden structures in data.
4. ‘m’ represents the slope of the regression line, indicating the rate of change in y for each unit change in x. ‘C’ represents the y-intercept, the point where the line crosses the y-axis.
5. A hyperplane is a decision boundary that separates data points into different classes in an SVM. In higher dimensions, it is a generalization of a line or plane.
6. The kernel trick maps data into a higher-dimensional space where it is easier to separate, even if the data is not linearly separable in its original space. A linear kernel indicates the data is linearly separable.
7. Pre-processing ensures data is in a suitable format for the model, handles missing values, and scales features to prevent bias. This increases the model’s performance and accuracy.
8. Temporal difference learning is a method of learning by estimating the value function (Q-value) based on the difference between the current estimate and the new estimate of the Q-value, leveraging immediate rewards and the agent’s experience. The current reward which is observed from the environment in response to the current action.
9. ‘K’ represents the number of clusters to form in the data. Choosing the right value is crucial because it directly affects how the data is grouped and can significantly impact the interpretability and usefulness of the clusters.
10. The elbow method is a heuristic used to determine the optimal number of clusters (‘K’) by plotting the within-cluster sum of squares (WCSS) against different values of K. The “elbow” point on the graph, where the rate of decrease in WCSS slows down, suggests a good balance between cluster compactness and the number of clusters.
Essay Questions:
1. Discuss the importance of understanding the domain in which a machine learning model is being applied. How can domain knowledge influence data pre-processing, model selection, and interpretation of results, citing examples from the provided sources?
2. Compare and contrast Ridge and Lasso regression. Under what circumstances would you choose one over the other, and what are the key differences in their mathematical formulations and effects on model coefficients?
3. Explain the challenges associated with vanishing and exploding gradients in recurrent neural networks (RNNs). How do Long Short-Term Memory (LSTM) networks address the vanishing gradient problem, and what are the key components of an LSTM cell that enable it to learn long-term dependencies?
4. Describe the Q-learning algorithm in detail, including the roles of exploration vs. exploitation, the temporal difference update rule, and the Q-table. How can Q-learning be applied to solve reinforcement learning problems in various environments?
5. Explain the process of building and training a convolutional neural network (CNN) for image classification, including data augmentation techniques, the role of different layers (convolutional, pooling, dense), activation functions, and optimization algorithms.
Glossary of Key Terms:
- Classification: A type of supervised learning where the goal is to predict the category or class to which a data point belongs.
- Regression: A type of supervised learning where the goal is to predict a continuous numerical value.
- Anomaly Detection: Identifying data points or patterns that deviate significantly from the normal behavior of a dataset.
- Clustering: An unsupervised learning technique that groups similar data points together based on their inherent characteristics.
- Linear Regression: A statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.
- Slope: The rate of change of a line, indicating how much the dependent variable changes for each unit change in the independent variable.
- Y-Intercept: The point where a line crosses the y-axis, representing the value of the dependent variable when the independent variable is zero.
- Hyperplane: A generalization of a line or plane to higher dimensions, used as a decision boundary to separate data points in different classes.
- Support Vector Machine (SVM): A supervised learning algorithm that finds the optimal hyperplane to separate data points into different classes, maximizing the margin between the classes.
- Kernel: A function that maps data into a higher-dimensional space to make it easier to separate using a linear classifier, even if the data is not linearly separable in its original space.
- Data Pre-processing: Preparing raw data for use in a machine learning model by cleaning, transforming, and scaling the data.
- Q-Learning: A reinforcement learning algorithm that learns an optimal policy by estimating the Q-value, which represents the expected reward for taking a specific action in a given state.
- Temporal Difference (TD) Learning: A method of learning by bootstrapping from the current estimate of the value function, updating it based on the difference between the current estimate and the new estimate.
- Exploration vs. Exploitation: The trade-off in reinforcement learning between exploring new actions to discover potentially better strategies and exploiting known actions to maximize immediate rewards.
- Q-Table: A table that stores the Q-values for all possible state-action pairs, used by the agent to make decisions in Q-learning.
- K-Means Clustering: An unsupervised learning algorithm that partitions data points into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).
- Elbow Method: A heuristic used to determine the optimal number of clusters (K) in K-means clustering by plotting the within-cluster sum of squares (WCSS) against different values of K.
- Ridge Regression: A linear regression technique that adds a penalty term to the loss function to prevent overfitting, shrinking the coefficients towards zero.
- Lasso Regression: A linear regression technique that adds a penalty term to the loss function to prevent overfitting, forcing some of the coefficients to be exactly zero, effectively performing feature selection.
- Recurrent Neural Network (RNN): A type of neural network designed to process sequential data, maintaining a hidden state that is updated at each time step based on the input and the previous hidden state.
- Vanishing Gradient Problem: A challenge in training RNNs where the gradients become too small, preventing the network from learning long-term dependencies.
- Exploding Gradient Problem: A challenge in training RNNs where the gradients become too large, causing the network to become unstable and diverge.
- Long Short-Term Memory (LSTM): A type of RNN architecture designed to address the vanishing gradient problem and learn long-term dependencies, using memory cells and gates to regulate the flow of information.
- Convolutional Neural Network (CNN): A type of neural network commonly used for image classification, using convolutional layers to extract features from images and pooling layers to reduce dimensionality.
- Data Augmentation: Techniques used to artificially increase the size of a training dataset by applying transformations such as rotations, flips, and translations to existing images.
- Activation Function: A function that introduces non-linearity into a neural network, enabling it to learn complex patterns in the data.
- Optimization Algorithm: An algorithm used to adjust the weights and biases of a neural network during training, minimizing the loss function and improving the model’s performance.
- Softmax: Output layer that gives probability distribution over all the output classes.
- ReLU (Rectified Linear Unit): A common activation function used in neural networks, defined as f(x) = max(0, x).
- Epoch: A complete pass through the entire training dataset during the training of a machine learning model.
- Overfitting: A phenomenon where a machine learning model learns the training data too well, resulting in poor performance on unseen data.
- Gradient Descent: Algorithm used to minimize the loss function to find the optimal parameters for the model.
- Stochastic Gradient Descent: A gradient descent optimization algorithm where the gradient is estimated based on a single random sample rather than the entire dataset.
- Cross-Entropy Loss: A loss function commonly used in classification problems, measuring the difference between the predicted probability distribution and the true distribution.
- Batch Normalization: A technique used to normalize the inputs to each layer in a neural network, improving training speed and stability.
- Tensor: The multi-dimensional array of data.
- TensorFlow: Open source machine learning platform.
- Keras: Machine learning API.
- Pandas: Library for data manipulation and analysis.
- NumPy: Library for numerical computing.
- Matplotlib: Library for data visualization.
- Seaborn: Library for data visualization based on matplotlib.
- Scikit-learn: Library for machine learning algorithms.
- CSV: Comma separated values file extension.
- API: Application programming interface that allows different systems to communicate.
- Python: General purpose programming language.
- Jupyter Notebook: Interactive coding web application.
- Anaconda: Collection of open source packages.
- Hugging Face: Natural language processing framework.
- Transformers: Deep learning models from Hugging Face.
- Librosa: Python package for music and audio analysis.
- OpenAI: Artificial intelligence research company.
- Embeddings: Representing words or phrases as numerical vectors that can be used in machine learning models.
- Normalization: Transform the values of numeric data to a standard range.
- RNN gates: LSTM models have input, forget, and output gates.
- Loss Function: Measures difference between actual and predicted results.
Machine Learning: Concepts, Algorithms, and Applications

Okay, I will synthesize the provided documents into a briefing document summarizing the main themes, important concepts, and notable quotes.

Briefing Document: Machine Learning Concepts and Applications

Overview:

This document synthesizes information from a variety of sources on machine learning (ML) concepts and their applications. The sources cover a range of topics from fundamental algorithms like linear regression and K-means clustering to more advanced neural networks such as Support Vector Machines (SVMs), Q-learning and Recurrent Neural Networks (RNNs) and the use of tools like Python, and libraries such as NumPy, Pandas, Scikit-learn, TensorFlow, and Keras to implement and evaluate these models.

1. Fundamental Machine Learning Concepts:
- Classification vs. Regression: The document highlights the core distinction between classification and regression tasks.
- Classification: Categorizes data into discrete classes (e.g., “whether the stock price will increase or decrease”). The desired output is a “yes no 01” answer.
- Regression: Predicts a continuous quantity (e.g., “predicting the age of a person based on the height weight health and other factors”).
- Anomaly Detection: Identifying unusual patterns or outliers in data. This is described as “very big in data science these days” with applications like detecting fraudulent money withdrawals or identifying unusual stock market behavior.
- Clustering: Discovering structure in unlabeled data by grouping similar data points together. Example: “finding groups of customers with similar Behavior given a large database of customer data containing their demographics and past buying records.”
2. Core Algorithms and Techniques:
- Linear Regression:The document explains how to calculate the “best fit line” by finding the slope (m) and y-intercept (c) of the equation y = mx + c.
- The formula for calculating the slope (m) is given as: “m equal the sum of x – x average * y – y average or y means and X means over the sum of x – x means squared”. The text emphasizes that “the linear regression model should go through that dot” referring to the mean of both the x and y values.
- Support Vector Machines (SVM):SVMs are used for classification by finding a hyperplane that best separates data points into different classes. The goal is to maximize the distance between the hyperplane and the nearest data points (the “maximum distance margin”).
- The document uses the example of classifying muffin and cupcake recipes based on ingredients like flour, milk, sugar, and butter. It notes that “muffins have more flour while cupcakes have more butter and sugar.” The tutorial uses Python’s scikit-learn library (sklearn) to implement an SVM classifier.
- The document points out that the “caborn sits on top of map plot Library just like pandas hits on numpy so it adds a lot more features and uses and control”.
- K-Means Clustering:An unsupervised learning algorithm used to group data points into K clusters based on their proximity to cluster centers.
- The “elbow method” is mentioned as a way to determine the optimal number of clusters (K) by plotting the within-cluster sum of squares (WCSS) and looking for the “elbow joint” in the graph.
- A use case is provided to “Cluster cars into Brands using parameters such as horsepower cubic inches make year Etc.”
- K-Nearest Neighbors (KNN):A classification algorithm that classifies a data point based on the majority class of its K nearest neighbors.
- The Euclidean distance formula is used to determine the distance between data points: “distance D equals the square root of x – a squared + y – b squared”
- The example provided is “predict whether a person will be diagnosed with diabetes or not”.
- Ridge and Lasso Regression:Regularization techniques used to prevent overfitting in linear models.
- Ridge Regression: Adds a penalty term proportional to the sum of the squares of the coefficients.
- Lasso Regression: Adds a penalty term proportional to the sum of the absolute values of the coefficients.
- The document notes: “Ridge regularization is useful when we have many variables with relatively smaller data samples… The Lasso regularization model is preferred when we are fitting a linear model with fewer variables.”
- Q-Learning:A reinforcement learning algorithm used to learn an optimal policy for an agent interacting with an environment.
- The core concept is the “Q-table,” which is a “repository of rewards basically which is associated with the optimal actions for each state in a given environment.”
- The “temporal difference” is mentioned as a way to calculate the Q values, comparing the “current state and action values with the previous one.”
- The “Belman Ford equation” is described as a “recursive equation” used to calculate the value of a given state and determine its optimal position.
- The algorithm involves balancing “exploration and exploitation” to find the best course of action.
- Alpha is “a step length basically which is here taken to estimate the update estimation of Q of s OFA”. Gamma is a discount factor where it “should be greater than or equal to zero or it can be less than equal to 1”.
- Recurrent Neural Networks (RNNs) and LSTMs:RNNs are designed to process sequential data by maintaining a hidden state that is passed from one time step to the next.
- The document discusses the “Vanishing gradient problem” and “exploding gradient problem” that can occur during RNN training.
- “When the slope is too small the problem is known as Vanishing gradient”
- “When the slope tends to grow exponentially instead of decaying this problem is called exploding gradient”
- Solutions for the exploding gradient problem include: identity initialization, truncate the back propagation, and gradient clipping.
- Solutions for the Vanishing gradient problem include: weight initialization, choosing the right activation function, and long short-term memory networks.
- Long Short-Term Memory (LSTM) networks are a special type of RNN capable of learning long-term dependencies.
- The document describes a use case of predicting stock prices using an LSTM network.
3. Software and Tools:
- Python: The primary programming language used for implementing machine learning models.
- NumPy: A library for numerical computing, providing support for arrays and mathematical functions. “Numpy is a python Library used for working with arrays”.
- Pandas: A library for data manipulation and analysis, providing data structures like DataFrames. “pandas is a software Library written for the Python programming language for the data manipulation and Analysis”.
- Scikit-learn (sklearn): A library providing machine learning algorithms and tools for tasks such as classification, regression, and clustering.
- TensorFlow: A deep learning framework developed by Google. “Tensor flow became the open source for it”.
- Keras: A high-level neural networks API that runs on top of TensorFlow.
4. Best Practices and Considerations:
- Data Preprocessing: The document emphasizes the importance of data preprocessing steps such as scaling features to a uniform range (e.g., between -1 and 1) to avoid biases due to large numbers.
- Model Evaluation: Various metrics are used to evaluate the performance of machine learning models, including:
- Confusion Matrix.
- F1 Score.
- Accuracy.
- Mean Squared Error (MSE).
- Importance of Domain Knowledge: The document highlights that the domain the model is working in is important. It might help the doctor know where to look just by understanding what kind of tumor it is, so it might help them or Aid them in something they missed from before.
5. Case Studies and Applications:
- Tumor Classification: Classifying tumors as malignant or benign.
- Diabetes Prediction: Predicting whether a person will be diagnosed with diabetes.
- Stock Price Prediction: Using LSTM networks to predict stock prices.
- Speech-to-Text Recognition: Mentioning “hugging face for this piece to text recognition”.
Conclusion:

The sources underscore the breadth of machine learning techniques and their applicability across diverse domains. A strong understanding of the fundamental concepts, algorithms, and the appropriate use of software tools are vital to successfully applying machine learning in solving real-world problems. The need for domain expertise when developing ML models is also emphasized.

Machine Learning and Neural Networks: Answering Common Questions

Machine Learning & Neural Network FAQ

1. What is the difference between classification and regression in data science?

Classification involves categorizing data into predefined classes (e.g., “yes/no” or “increase/decrease”), providing a discrete output. Regression, on the other hand, predicts a continuous quantity (e.g., age based on height and weight). They are two of the major divisions in machine learning.

2. What are some common applications of anomaly detection?

Anomaly detection identifies unusual patterns or outliers in data. Common applications include detecting fraudulent money withdrawals, identifying stock market irregularities to adjust trading strategies, and pinpointing unusual activity in network security.

3. How does clustering work, and what is its purpose?

Clustering is an unsupervised learning technique that discovers inherent structures in data by grouping similar data points together. This is useful for tasks like customer segmentation based on demographics and buying behavior, allowing for targeted marketing strategies.

4. How does linear regression work, and what are its key components?

Linear regression models the relationship between variables using a straight line. Key components include calculating the mean of the x and y values, determining the slope (m) and y-intercept (c) of the line using formulas involving sums of differences from the means (y = mx + c), and ensuring the regression line passes through the point representing the means of x and y.

5. What is a Support Vector Machine (SVM), and how does it classify data?

A Support Vector Machine (SVM) is a supervised learning algorithm used for classification. It finds the optimal hyperplane that maximizes the margin between different classes in a dataset. New data points are then classified based on which side of the hyperplane they fall. In higher dimensions, the hyperplane becomes a multi-dimensional cut to best separate the data.

6. How does the K-Nearest Neighbors (KNN) algorithm work?

KNN classifies a new data point based on the majority class of its ‘k’ nearest neighbors in the feature space. The distance between data points is often calculated using Euclidean distance. The choice of ‘k’ is crucial; a smaller ‘k’ can lead to overfitting, while a larger ‘k’ might smooth out important decision boundaries.

7. What is Q-learning, and what are the key elements of the Q-learning update rule?

Q-learning is a reinforcement learning algorithm where an agent learns to make optimal decisions in an environment by estimating the Q-value, which represents the expected reward for taking a specific action in a specific state. Key elements in the update rule include: the current state (s), the action taken (a), the immediate reward (R), a discount factor (gamma) for future rewards, and a learning rate (alpha) to determine the step size for updating the Q-value.

8. What is the “vanishing gradient” problem in recurrent neural networks (RNNs) and what are some solutions?

The vanishing gradient problem occurs during RNN training when gradients become extremely small as they are backpropagated through time. This makes it difficult for the network to learn long-term dependencies. Solutions include: identity initialization, truncating back propagation, gradient clipping, weight initialization, choosing the correct activation function, and using Long Short-Term Memory (LSTM) networks.

Machine Learning: Concepts, Types, Applications, and Algorithms

Machine learning is a universe where machines learn, adapt, and make decisions similar to humans. It involves training machines to learn from past data, enabling them to understand and reason, and to perform tasks much faster than humans.

Core Concepts and Types of Machine Learning:
- Supervised Learning: This involves training a model using labeled data, where the machine learns the association between features and labels. For example, a model can learn to predict the currency of a coin based on its weight, using weight as the feature and currency as the label. Common algorithms used include Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) for tasks like image classification and language translation.
- Unsupervised Learning: This type uses unlabeled data to identify patterns. The machine identifies patterns and groups data points into clusters without prior labels. An example includes clustering cricket players into batsmen and bowlers based on their scores and wickets taken, without pre-defined labels. Autoencoders and generative models are used for tasks like clustering and anomaly detection.
- Reinforcement Learning: A reward-based learning system based on feedback. The system learns from positive or negative feedback to correctly classify data. Deep Q-Networks are used for tasks like robotics and gameplay.
Key Steps in Machine Learning:
1. Define Objective: Determine what you want to predict.
2. Collect Data: Gather data relevant to the prediction objective.
3. Prepare Data: Clean the collected data to ensure its quality.
4. Select Algorithm: Choose the appropriate machine learning algorithm.
5. Train Algorithm: Train the selected algorithm with the prepared data.
6. Test Model: Validate the model to ensure it works.
7. Run Prediction: Apply the model to make predictions.
8. Deploy Model: Implement the model for real-world applications.
Applications of Machine Learning:
- Healthcare: Machine learning is used to predict diagnostics and analyze medical images for early disease detection.
- Finance: It is applied in fraud detection and analyzing bank data for suspicious transactions.
- E-commerce: Used to predict customer churn.
- Transportation: Machine learning powers real-time differential pricing based on demand and predictive modeling to predict high-demand areas. It is also used in self-driving cars to detect objects and make driving decisions.
- Natural Language Processing (NLP): Machine learning enables sentiment analysis, language translation, and text generation, which are used in virtual assistants and chatbots.
Example Algorithms
- Linear Regression: Assumes a linear relationship between input and output variables.
- Decision Tree: Uses a tree-like structure to make decisions based on data features.
- Support Vector Machine: Creates a separation line to divide classes in the best possible way.
- K-Nearest Neighbors (KNN): Classifies data based on feature similarity and the categories of its nearest neighbors.
- Deep Learning: Uses neural networks to automatically discover representations from raw data, ideal for image recognition and speech recognition.
Supervised vs. Unsupervised Learning:
- Supervised Learning: Uses labeled data with direct feedback and predicts outcomes.
- Unsupervised Learning: Uses unlabeled data, finds hidden structures, and groups data.
Divisions of Machine Learning
- Classification: Predicts a category, like whether a stock price will increase or decrease.
- Regression: Predicts a quantity, such as predicting the age of a person based on health factors.
- Anomaly Detection: Detects unusual patterns, such as detecting fraudulent money withdrawals.
- Clustering: Discovers structure in data, such as grouping customers with similar behavior.
Additional considerations:
- LLM Benchmarks: Standardized tools are used to evaluate the performance of large language models (LLMs).
- LLM Leaderboards: Rankings of LLMs are based on benchmark scores.
- Ethical Concerns: Deep learning techniques can be used to create deepfakes, raising ethical concerns regarding misinformation and digital manipulation.
Linear Regression: Concepts, Formula, and Implementation

Linear regression is a well-known and understood algorithm in statistics and machine learning. It models a linear relationship between input variables (X) and a single output variable (Y).

Core Concept
- Linear regression assumes a linear relationship between input variables (X) and a single output variable (Y).
- The goal is to find the line that best fits the data points and describes the relationship between the two variables.
Formula
- The linear regression model is represented by the equation y = mx + C.
- y = dependent variable
- x = independent variable
- m = coefficient, representing the slope of the line
- C = the Y-intercept
Positive and Negative Relationships
- Positive Relationship: As the input variable (x) increases, the output variable (y) also increases, resulting in a positive slope.
- Negative Relationship: As the input variable (x) increases, the output variable (y) decreases, resulting in a negative slope.
Mathematical Implementation To calculate the exact line for linear regression:
1. Calculate the Mean: Find the mean (average) of both the x values (x̄) and the y values (ȳ).
2. Regression Equation: Determine the slope (m) and the y-intercept (c) for the equation y = mx + c.
- m = Σ[(x – x̄) * (y – ȳ)] / Σ(x – x̄)²
1. Calculate the Value of c: c = ȳ – m * x̄. The linear regression line should pass through the mean value.
2. Plot the Regression Line: Use the equation y = mx + c to plot the regression line.
3. Compute New Values: Use the derived equation to compute predicted values of Y (ŷ).
Error Minimization
- Calculate the error, which is the difference between the predicted values and the actual values.
- Minimize this error to improve the model. Methods include Sum of Squared Errors, Sum of Absolute Errors, and Root Mean Square Error.
Fitting the Data
- Data fitting involves plotting data points and drawing the best-fit line to understand variable relationships.
- Mean Square Error (MSE), also known as the loss function, is used to calculate the average squared difference between the predicted and actual values.
Bias and Variance
- Bias occurs when the algorithm has limited flexibility and oversimplifies the model.
- Variance defines the algorithm’s sensitivity to specific data sets.
Regularization
- Regularization techniques calibrate linear regression models, minimize the adjusted loss function, and prevent overfitting or underfitting.
- Ridge Regression: Adds a penalty equivalent to the sum of the squares of the magnitude of coefficients to the loss function.
- Lasso Regression: Adds a penalty equivalent to the absolute value of the magnitude of coefficients to the loss function.
When to Use Ridge vs. Lasso
- Ridge Regularization: Useful with many variables and relatively smaller data samples. It does not force coefficients to zero but makes them closer to zero.
- Lasso Regularization: Preferred when fitting a linear model with fewer variables and encourages coefficients to go toward zero.
Reinforcement Learning: Concepts, Strategies, and Applications

Reinforcement learning is a subfield of machine learning focused on training a model to make a sequence of decisions in an environment to achieve an optimal solution for a problem. It enables machines to learn by themselves through trial and error, rather than relying solely on human instruction or labeled data.

Key Concepts and Components
- Agent: The model being trained to perform actions within the environment. The agent can be a neural network or use a Q table, or a combination of both.
- Environment: The training situation in which the agent operates and which the model must optimize.
- Action: A step taken by the model within the environment. The agent selects one action from the possible steps it can take.
- State: The current condition or position returned by the model, providing information about the environment.
- Reward: Points given to the model to reinforce desired actions and optimize behavior.
- Policy: Determines how an agent will behave at a given time, mapping actions to the present state and guiding decision-making.
Learning Strategies
- Trial and Error: The agent explores different actions and learns from the outcomes, adjusting its strategy to maximize rewards.
- Exploration vs. Exploitation: Balancing exploration of new actions with exploitation of known rewarding actions is crucial for effective learning. Exploration involves random actions to discover new possibilities, while exploitation uses existing knowledge to maximize rewards.
Types of Learning
- Unlike supervised learning, reinforcement learning does not rely on labeled data or pre-specified output values.
- It also differs from unsupervised learning, which focuses on finding patterns in unlabeled data without explicit rewards.
Markov Decision Process (MDP)
- Reinforcement learning uses the Markov Decision Process to map a current state to an action, with the agent continuously interacting with the environment to produce new solutions and receive rewards.
- The MDP involves interactions between the agent and the environment, where the environment provides a reward and state, and the agent takes an action based on a policy.
Q-Learning
- Q-learning is a type of reinforcement learning that enables a model to iteratively learn and improve over time by taking optimal action selection policies.
- It uses Q values, defined for states and actions, to estimate how good it is to take an action at a given state.
- Temporal Difference (TD) update rule is used to iteratively compute the estimation of Q values.
- A Q table serves as a repository of rewards associated with optimal actions for each state, guiding the agent in decision-making.
Applications
- Robotics: Reinforcement learning is used to train robots to perform tasks by learning from feedback and optimizing their actions.
- Game Playing: Reinforcement learning algorithms can learn to play games by trial and error, achieving high levels of performance.
- Resource Management: It is used for optimizing resource allocation and decision-making in complex systems.
- Autonomous Vehicles: Deep reinforcement learning contributes to autonomous vehicles by training them to make driving decisions based on sensor data and rewards.
Limitations and Considerations
- High Computational Requirements: Training reinforcement learning models can be computationally intensive and time-consuming, especially for complex problems.
- Infant Stage: Reinforcement learning is still in its early stages of development, particularly in solving complex, real-world problems.
- Reward System Design: Devising an effective reward system is critical for guiding the agent’s learning process and achieving desired outcomes.
- Exploration Challenges: Reinforcement learning models often explore many different directions, which can require significant processing time.
RNN
- Recurrent Neural Networks (RNNs) are designed to process sequential data, like time series, speech, and text, by using a hidden state that passes from one time step to the next.
- Long Short-Term Memory (LSTM) networks are a special kind of RNN capable of learning long-term dependencies and remembering information over extended periods. LSTMs use gates (input, forget, and output) to control the flow of information and selectively retain or discard information.
Neural Networks and Deep Learning: An Overview

Neural networks are a cornerstone of deep learning, inspired by the structure and function of the human brain. They consist of interconnected artificial neurons that process information to solve complex problems.

Core Components and Structure
- Artificial Neurons: Neural networks simulate the human brain using artificial neurons, which receive inputs, process them, and produce an output. These neurons are interconnected and organized in layers.
- Layers:Input Layer: Receives data from external sources.
- Hidden Layers: Perform complex transformations on the input data. A network can have one or more hidden layers.
- Output Layer: Produces the final result or prediction.
- Connections and Weights: Each connection between neurons has a weight, which is adjusted during training to optimize the network’s performance.
- Activation Functions: Every neuron contains an activation function that determines whether it should be “fired” or activated, thereby influencing the output. Common activation functions include ReLU and Sigmoid.
- Perceptron: A basic unit of a neural network, consisting of at least one neuron, used for binary classification.
How Neural Networks Work
1. Input Processing: The input layer receives data, which is then passed through the hidden layers.
2. Weighted Sum: Each neuron computes a weighted sum of its inputs and applies an activation function to produce an output.
3. Training: The network adjusts the weights of the connections to optimize performance. This process involves feeding data through the network, comparing the output to the expected result, and updating the weights and biases based on the error.
4. Backpropagation: The error between the predicted and actual outputs is fed back through the network to adjust the weights and biases. This process continues iteratively until the error is minimized.
5. Minimizing Error: Neural network training involves iteratively updating weights and biases to minimize the error between predicted and actual outputs.
6. Gradient Descent: An optimization technique used to find the global minimum of the cost function, helping the network identify the optimal weights and biases.
Types of Neural Networks
- Feedforward Neural Networks (FNN): The simplest type, where information flows linearly from input to output. They are used for image classification, speech recognition, and natural language processing.
- Convolutional Neural Networks (CNN): Designed for image and video recognition, CNNs automatically learn features from images, making them ideal for object detection and image segmentation.
- Recurrent Neural Networks (RNN): Specialized for processing sequential data like time series and natural language. They maintain an internal state to capture information from previous inputs, making them suitable for speech recognition and language translation.
- Deep Neural Networks: Neural networks with multiple layers that can automatically learn features from data, making them suitable for image recognition, speech recognition, and natural language processing.
- Deep Belief Networks
- Generative Adversarial Networks (GANs): Used for synthesizing images, music, or text.
Applications of Deep Learning
- Autonomous Vehicles: Deep learning algorithms process data from sensors and cameras to detect objects, recognize traffic signs, and make driving decisions in real-time.
- Healthcare Diagnostics: Deep learning models analyze medical images such as X-rays, MRIs, and CT scans to help in the early detection and diagnosis of diseases like cancer.
- Natural Language Processing (NLP): Deep learning models like Transformer architectures have led to more sophisticated text generation, translation, and sentiment analysis.
- Robotics: Neural networks are used to develop human-like robots.
- Predictive Maintenance: Deep learning models predict equipment failures in industries like manufacturing and aviation by analyzing sensor data.
Advantages and Disadvantages
- Advantages:High Accuracy: Achieve state-of-the-art performance in tasks like image recognition and natural language processing.
- Automated Feature Engineering: Automatically discover and learn relevant features from data without manual intervention.
- Scalability: Can handle large, complex datasets and learn from massive amounts of data.
- Disadvantages:High Computational Requirements: Require significant data and computational resources for training.
- Large Labeled Datasets: Often require extensive labeled datasets for training, which can be costly and time-consuming.
- Overfitting: Can overfit to training data, leading to poor performance on new, unseen data.
Tools and Platforms
- TensorFlow: An open-source platform created by Google, widely used for developing deep learning applications. It supports multiple languages, with Python being the most common.
- Keras: A high-level API written in Python that simplifies the implementation of neural networks. It uses deep learning frameworks like TensorFlow as a backend to make computation faster and provides a user-friendly front end.
- PyTorch: Another deep learning framework.
Key Considerations
- Data Preprocessing: Essential for ensuring that the data is properly scaled and formatted for training.
- Hyperparameter Tuning: Optimizing model parameters to improve performance.
- Confusion Matrices: Useful tools for measuring the performance of a classifier in detail, showing where the model is making mistakes.
Data Analysis: Process, Tools, and Applications

Data analysis involves a process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.

Here’s a breakdown of key aspects of data analysis, drawing from the sources:
- Objective Definition: A crucial initial step is defining the objective to guide the subsequent steps. Knowing what needs to be predicted is very important.
- Data Collection: This involves gathering relevant data that matches the defined objectives. A significant amount of time in data science is spent collecting data.
- Data Preprocessing: Preparing the data to ensure its quality is very important.
- Cleaning involves handling missing values and outliers, as well as removing special characters, links, mentions, hashtags, and stop words from text.
- It may also be important to address biases in the data. Scaling data, for instance, can help eliminate bias by normalizing values.
- Tokenization and lemmatization reduce words to their base form.
- Algorithm Selection: This step includes selecting the appropriate algorithm, and training it with the prepared data.
- Model Testing: Testing the model to validate its performance and determine its effectiveness for the task at hand.
- Prediction and Deployment: Once the model is tested and validated, it is deployed to make predictions on new data.
- Types of Prediction:
- Classification: Categorizing data, like predicting if a stock price will increase or decrease.
- Regression: Predicting a quantity, such as predicting a person’s age based on various factors.
- Anomaly Detection: Identifying unusual patterns or outliers, for example, detecting fraudulent money withdrawals.
- Clustering: Discovering structure in unexplored data by grouping similar data points together, such as finding customer segments with similar behavior.
- Tools and Techniques:
- Python: A popular programming language for data science.
- Libraries: NumPy, pandas, scikit-learn, matplotlib, and Seaborn are commonly used libraries.
- NumPy is used for numerical computations and array manipulation.
- Pandas provides data structures like DataFrames for easy data manipulation and analysis.
- Scikit-learn (sklearn) offers various machine learning algorithms and tools for model selection, training, and evaluation.
- Matplotlib and Seaborn are used for data visualization and creating plots.
- Jupyter Notebooks: Interactive environments for coding, documentation, and visualization.
- Confusion Matrix: A tool to evaluate the performance of a classification model by breaking down correct and incorrect classifications.
- Heat Maps: Use color-coding to visualize data, offering a quick way to identify patterns and correlations between variables.
- Key Considerations:
- Data Quality: Ensuring data is accurate, complete, and relevant to avoid misleading results. “Good data in, good answers out; bad data in, bad answers out”.
- Overfitting: Models that are too closely fit to the training data may perform poorly on new data.
- Underfitting: Models that are too simple fail to capture the underlying patterns in the data.
- Applications:
- Marketing: Grouping customers based on behavior to improve targeting.
- Finance: Detecting anomalies in financial transactions.
- Healthcare: Predicting disease diagnoses based on patient data.
- Business: Optimizing operations, forecasting sales, and understanding customer behavior.
- Customer Segmentation: Identifying distinct groups based on purchasing behavior and demographics.
- Sentiment Analysis: Determining the sentiment expressed in text data, such as social media posts.
- Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) can simplify data sets, reduce computation time, remove redundancy, and improve data visualization. PCA combines variables, determines the best perspective, and reduces the number of features needed for analysis.
Data analysis is an iterative process. It may be necessary to revisit earlier steps as new insights emerge or as the data reveals unexpected patterns.

Machine Learning Full Course 2025 | Machine Learning Tutorial For Beginners | Simplilearn

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 18, 2025
AI, Machine Learning, and Deep Learning Essentials
The provided document serves as a comprehensive educational resource on artificial intelligence, machine learning, and deep learning. It starts with basic definitions and progresses to cover advanced topics like neural networks, language processing, and computer vision. The material discusses algorithms, techniques, and tools used in AI development, highlighting real-world applications across various industries such as healthcare, finance, and retail. It emphasizes the importance of ethical considerations, responsible AI practices, and the skills needed to pursue a career in this evolving field. Practical examples and code snippets are included, with a strong focus on using Python and popular libraries like TensorFlow. The document also compares different learning methods, such as supervised, unsupervised, and reinforcement learning.

Artificial Intelligence and Deep Learning Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What is NumPy, and why is it essential in machine learning?
2. Explain the difference between stemming and lemmatization in natural language processing.
3. What is an activation function in the context of artificial neural networks, and what role does it play?
4. Describe the purpose and function of a “dense layer” in a neural network.
5. What are stop words, and why are they removed in NLP tasks?
6. Explain the purpose of a document term matrix (DTM) in natural language processing.
7. Describe the basic structure and function of a single artificial neuron (perceptron).
8. What are exploding and vanishing gradients, and how can they affect the training of recurrent neural networks (RNNs)?
9. What are LSTMs and how do they address the limitations of traditional RNNs?
10. Explain the roles of the generator and discriminator in generative adversarial networks (GANs).
Quiz Answer Key
1. NumPy is a Python library primarily used for numerical computations, providing support for multi-dimensional arrays and mathematical functions. It is crucial in machine learning for efficient data manipulation and mathematical operations necessary for training models.
2. Stemming and lemmatization are techniques in NLP to reduce words to their root form. Stemming uses heuristics to chop off prefixes or suffixes, while lemmatization considers the word’s meaning and morphological analysis to return a valid word (lemma).
3. An activation function in neural networks introduces non-linearity, allowing the network to learn complex patterns. It determines whether a neuron should “fire” based on a threshold, transforming the weighted sum of inputs into an output signal.
4. A dense layer is a standard layer type in neural networks where each neuron is connected to every neuron in the preceding layer. These layers learn complex relationships between features by adjusting the weights of these connections.
5. Stop words are common words in a language (e.g., “the,” “is,” “a”) that are often removed from text during NLP tasks. Removing them helps to focus on more meaningful words and reduce noise in the data.
6. A document term matrix (DTM) in NLP is a matrix that represents the frequency of words in a collection of documents. It is used to quantify and compare documents based on their word content, enabling various text analysis tasks.
7. A perceptron consists of inputs, weights, a summation function, and an activation function. It calculates a weighted sum of inputs, applies the activation function to determine the output, and is the basic building block of neural networks.
8. Exploding gradients cause instability due to extremely large weight updates, while vanishing gradients hinder learning due to minuscule weight updates. Techniques like gradient clipping, truncated BPTT, and ReLU activation functions are used to mitigate these problems.
9. LSTMs (Long Short-Term Memory networks) are a type of RNN architecture designed to handle long-term dependencies by incorporating a cell state and gates (forget, input, output) to regulate information flow, thus addressing vanishing gradient problems.
10. In GANs (Generative Adversarial Networks), the generator creates synthetic data (e.g., images), while the discriminator evaluates whether the data is real or fake. They compete in a zero-sum game, improving each other until the generator produces highly realistic data.
Essay Questions
1. Discuss the role of transfer learning in deep learning. How does it improve efficiency and performance, and what are some of its limitations?
2. Explain the process of training a deep neural network, including the concepts of forward propagation, backpropagation, loss functions, and optimization algorithms.
3. Compare and contrast different types of neural network architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers.
4. Discuss the ethical considerations surrounding the development and deployment of AI technologies, including bias, privacy, and job displacement.
5. Describe the application of AI in a specific industry (e.g., healthcare, finance, transportation) and discuss the potential benefits and challenges associated with its adoption.
Glossary of Key Terms
- Activation Function: A function in a neural network that introduces non-linearity, determining whether a neuron should “fire” or not.
- Adam Optimizer: An optimization algorithm used to update the weights of a neural network during training, combining the benefits of AdaGrad and RMSProp.
- Artificial Neural Network (ANN): A computational model inspired by the structure and function of biological neural networks, used for machine learning and deep learning.
- Backpropagation: An algorithm used to train neural networks by calculating the gradient of the loss function with respect to the network’s weights and biases.
- Convolutional Neural Network (CNN): A type of neural network designed for processing grid-like data, such as images, using convolutional layers.
- Dense Layer: A fully connected layer in a neural network where each neuron is connected to every neuron in the preceding layer.
- Document Term Matrix (DTM): A matrix representing the frequency of words in a collection of documents, used for text analysis.
- Epoch: One complete pass through the entire training dataset during the training of a neural network.
- Generative Adversarial Network (GAN): A type of neural network architecture consisting of two networks (generator and discriminator) that compete against each other.
- Lemmatization: The process of reducing words to their base or dictionary form (lemma) using morphological analysis.
- Long Short-Term Memory (LSTM): A type of recurrent neural network architecture designed to handle long-term dependencies in sequential data.
- Natural Language Processing (NLP): A field of artificial intelligence focused on enabling computers to understand, interpret, and generate human language.
- NumPy: A Python library used for numerical computations, providing support for multi-dimensional arrays and mathematical functions.
- Optimizer: An algorithm used to adjust the parameters of a machine learning model to minimize the loss function.
- Perceptron: A single-layer neural network that performs binary classification by learning a linear decision boundary.
- ReLU (Rectified Linear Unit): A commonly used activation function in neural networks, defined as f(x) = max(0, x).
- Recurrent Neural Network (RNN): A type of neural network designed for processing sequential data, such as text or time series.
- Stemming: The process of reducing words to their root form by chopping off prefixes or suffixes.
- Stop Words: Common words in a language (e.g., “the,” “is,” “a”) that are often removed from text during NLP tasks.
- TensorFlow: An open-source software library for machine learning and deep learning, developed by Google.
- Tokenization: A process in natural language processing that involves breaking down a text into smaller units called tokens (words, phrases, symbols).
- Truncated Backpropagation Through Time (TBPTT): A variant of backpropagation through time used to train recurrent neural networks by limiting the number of time steps considered during backpropagation.
AI, ML, and NLP: Concepts and Applications

Okay, here’s a briefing document summarizing the main themes and important ideas from the provided document excerpts:

Briefing Document: Analysis of AI and Machine Learning Concepts

Overview: The document excerpts cover a wide range of topics within the fields of Artificial Intelligence (AI), Machine Learning (ML), and Natural Language Processing (NLP). It provides introductions to fundamental concepts, tools, techniques, and use cases within these domains. The material seems designed for instructional purposes, offering practical examples and code snippets to illustrate the concepts.

Key Themes and Ideas:
1. Introduction to Python and Essential Libraries:
- The document begins with the basics of Python setup and introduces key libraries for data science and ML:
- NumPy: “It makes complex mathematical implementations very simple right it’s mainly known for computing mathematical data so numai is a package that you should be using for any sort of statistical analysis or data analysis that involves a lot of math.” NumPy is essential for numerical computation and array manipulation.
- Pandas: Used for data processing and working with data in CSV format. Example provided of loading a CSV file for weather prediction.
- TensorFlow: “Tensorflow is nothing but a python library for implementing deep learning models.” A core library for building and training deep learning models.
- Matplotlib: “mat plot lab is used for visualization.” Used for creating plots and visualizations of data.
1. Machine Learning Fundamentals:
- Classification: Described with an example of predicting rain: “learning model has to classify the output into two classes that is either yes or no yes will stand for it will rain tomorrow and no will basically denot that it will not rain tomorrow right this is a classification problem.”
- Neural Networks:Dense Layers: “a dense layer is standard layer type that works for most cases right in a dense layer all the neurons in the layer will receive input from all the neurons in the previous layer.”
- Activation Functions: “activation function it is nothing but in order to provide a threshold so if your output is above the threshold then only this neuron will fire otherwise it won’t fire.” Examples mentioned: step function, sigmoid function, ReLU (Rectified Linear Unit) function (“we want to remove all the negative values from our output that we got through the convolution layer”).
- Model Training: Includes steps for defining a model, compiling it with an optimizer (Adam is mentioned: “The optimizer equal to Adam uses the Adam Optimizer an efficient and widely used algorithm for optimizing neural networks”), a loss function (e.g., “pass underscore categorical underscore cross entrophy cross entropy”), and metrics (e.g., “metrix equal to accuracy”). The fit function is used for training.
- Evaluation: The evaluate function is used to evaluate the model’s performance on test data.
- Convolutional Neural Networks (CNNs):Image Processing: Explains how computers interpret images using pixel values and color channels (RGB). Image size is represented as “B cross a cross 3” where B is rows, A is columns, and 3 is the color channels.
- Feature Extraction: Discusses using filters to extract features from images: “we are going to put this particular feature on our image of X all right and we are going to multiply the corresponding pixel values.”
- Max Pooling: A method for reducing the size of the image and retaining the most important information. “we are reducing the size of our image…we have taken a window size of 2 cross2 so when we keep this window at this particular position we see that one is the highest value so we going to keep one here.”
- Object Detection: Mentions YOLO (You Only Look Once) and SSD (Single Shot Detector) as algorithms used for object detection in applications like self-driving cars and security systems.
- Artificial Neurons (Perceptrons): Explanation of how a single artificial neuron, or perceptron, works, including inputs (X1, X2,…Xn), corresponding weights (W1, W2,…Wn), a weighted sum, and an activation function to determine if the neuron “fires.” The importance of assigning weights to different factors or inputs in a neuron and how a computer decides whether to increase or decrease a weight.
- Back Propagation: Involves calculating the change in error with respect to variables like weight to adjust the weights and reduce the error. “we are trying to reduce the error so for that we need to figure out what will be the change in error if my variables are changed.” A graph of square error versus weight is used to determine the correct weight value.
1. Natural Language Processing (NLP):
- Applications: NLP is used by Netflix to “understand the type of movies that a person likes by the way a person has rated the movie or by the way the person has reviewed a movie so by understanding what type of review a person is giving to a movie Netflix will recommend more movies that you like.”
- Tokenization: Breaking down sentences into individual words or tokens.
- Stemming and Lemmatization: Techniques for reducing words to their root form.
- Stemming: “stemming algorithm basically does that it works by cutting off the end or the beginning of the word and taking into account a list of common prefixes and suffixes that can be found in an inflicted word.” Limitations of stemming are mentioned, as it can sometimes result in inaccurate root words.
- Lemmatization: “lemmatization on the other hand takes into consideration the morphological analysis of the words it does not randomly cut the word in the beginning and the ending it understands what the word means and only then it cuts the word.”
- Stop Words: Commonly used words that are often removed from text for analysis.
- Document Term Matrix (DTM): A matrix showing the frequency of words in a particular document.
- Natural Language Generation: Includes having a brief plan about the text, sentence planning, and text realization.
1. Recurrent Neural Networks (RNNs) and LSTMs:
- Recurrent Neural Networks (RNNs): Explains the concept of recurrent neural networks, where the output at a given time step (t) depends on the input at that time and the information from the previous time step (t-1).
- Long Short-Term Memory (LSTM) Networks: LSTMs address the limitations of traditional RNNs, such as vanishing and exploding gradients. The key to LSTM is the cell state, which is a horizontal line running through the top of the diagram. Discusses the forget gate layer, sigmoid layer, and tan layer. The four steps of LSTM are:
- Deciding what information to throw away from the cell state.
- Deciding what new information to store in the cell state.
- Combining the information to update the cell state.
- Getting the new output.
- Use Case: LSTM is used to predict the next word in a sentence.
1. Generative AI and Tools:
- Generative Adversarial Networks (GANs): Discusses the generator and discriminator components of GANs. “from random noise the generator generates an image which is evaluated by the discriminator that whether it’s a real or a fake image after evaluating the discriminator will send a feedback to the generator.”
- Google AI and Gemini: Describes how to set up a Conda environment and configure the Google AI API key using Python code, specifically working with the Gemini model. Code snippets are provided.
- Text Prediction: Describes how language models predict text by calculating probabilities for each possible word based on their likelihood in context, stating that language models trained on massive amounts of text gain a wider vocabulary and more nuanced understanding of language patterns.
- Image Generation with Parameters: Explores parameters in image generation:
- Aspect Ratio: Modifying the height and width ratio of an image (e.g., “16 by9”).
- Negative Prompting: Removing specific objects from an image (e.g., “clouds”).
- Stylize: Controls the imagination of the image.
- Chaos: A higher value of this parameter leads to unexpected and unique outcomes.
1. AI-Assisted Coding and Development:
- GitHub Copilot: Describes GitHub Copilot and its capabilities, including code completion, error fixing (“fix this option”), and answering questions about code.
- ChatGPT: One of the best things about Chip is that it gives free access to AI content development.
- Grammarly: A great tool for improving product description.
1. Other AI Concepts:
- Expert Systems: A computer system that mimics the decision-making ability of a human being.
- Fuzzy Logic Systems: Unlike traditional systems that give binary outputs, fuzzy logic systems can provide outputs with degrees of truth or certainty.
- Markov Decision Process: Discusses the components of a Markov decision process, including states, actions, rewards, policy, and value, and explains how an agent takes actions to transition between states while receiving rewards.
- Relationship to Human Brain: Neural Networks are similar to the human brain as just like how our brain contains billions of neurons similarly artificial neural networks contain multiple perceptrons. Dendrites, which receive input signals in the brain, are analogous to the input layer in artificial neural networks.
Quotes Demonstrating Practical Application:
- Example of setting up a Conda environment: “cond create hyphen P virtual environment which is V EnV Python and equal equal to we are using 3.10 which is the python version and give hyphen y”
- Example of installing TensorFlow: “pip install tensorflow”
- Example of code to load the MNIST dataset using TensorFlow: “train underscore images comma train uncore labels and give comma and again inside the bracket let us type testore images comma testore labels and equal to TF dot caras dot data sets dot mist. loore data”
- Example of defining a neural network model: “my model equal to TF dokas do models do sequential function”
Overall Impression:

The excerpts provide a valuable introduction to core AI, ML, and NLP concepts, offering a blend of theoretical explanations and practical examples, making it suitable for individuals learning or exploring these fields. The inclusion of code snippets and tool demonstrations enhances the material’s utility for hands-on learning.

AI, ML, and NLP: Concepts Explained

FAQ on Artificial Intelligence and Machine Learning Concepts

Here’s an 8-question FAQ based on the provided source material, covering key concepts in AI, machine learning, and natural language processing.

Question 1: What is NumPy and why is it important in machine learning?

NumPy is a Python library primarily used for numerical computations. Its most important feature is its support for multi-dimensional arrays. It simplifies complex mathematical implementations and is commonly used for statistical and data analysis, especially when handling large datasets. In machine learning, it is critical for handling data inputs and performing operations on tensors.

Question 2: How do classification models work, and what is a “target variable”?

Classification models categorize output into distinct classes (e.g., yes/no, cat/dog). The model learns from input variables (features) to predict the “target variable,” which is the variable we are trying to predict. An example is predicting whether it will rain tomorrow (target variable: “rain tomorrow”) based on various weather conditions (features: temperature, humidity, wind speed, etc.).

Question 3: Explain the process of building a neural network model, including layers and activation functions.

Building a neural network model involves creating layers, each with weights corresponding to the following layer. Dense layers are standard layer types for most cases, where all neurons are connected to each other. Activation functions (e.g., step function, sigmoid function, ReLU) introduce thresholds; a neuron “fires” only if its output exceeds this threshold. Training involves comparing the model’s output with the desired output and adjusting weights through a process like backpropagation to minimize the error.

Question 4: What is Natural Language Processing (NLP), and what are techniques like stemming and lemmatization used for?

NLP is a field focused on enabling computers to understand and process human language. Stemming simplifies word analysis by removing prefixes and suffixes to find the root form (e.g., “detecting,” “detected” become “detect”). Lemmatization, on the other hand, takes the morphological analysis of words into account, grouping together inflected forms of a word (e.g., “gone,” “going,” “went” become “go”). Lemmatization produces a proper word, while stemming may not. Stop words are common words removed to focus on more significant terms.

Question 5: What is a Document Term Matrix (DTM) and how is it used in NLP?

A Document Term Matrix (DTM) is a matrix that shows the frequency of words in a particular document. It helps understand if specific words are present in documents by assigning a numerical value that corresponds to the frequency of each word in each document.

Question 6: What are Convolutional Neural Networks (CNNs) and what are some of their applications?

CNNs are a type of neural network commonly used for image recognition and processing. They use filters to detect specific features in an image and ReLU functions to remove negative values from the output. Pooling reduces the size of the image while preserving important information. CNNs have applications in self-driving cars (detecting pedestrians), security systems (facial recognition), medical imaging (detecting anomalies), and satellite imagery (monitoring deforestation).

Question 7: Explain the concept of backpropagation and how it’s used to train neural networks.

Backpropagation is a process used to train neural networks by calculating the gradient of the loss function with respect to the network’s weights and biases. It involves computing the error between the predicted output and the actual output, then adjusting the weights to minimize this error. The process iteratively adjusts the weights until the network’s performance improves and the error is minimized.

Question 8: What are Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), and how do they address limitations of standard neural networks in processing sequential data?

RNNs are designed for processing sequential data (e.g., text, time series). They have feedback loops that allow information to persist across time steps. LSTMs are a special type of RNN that addresses the vanishing gradient problem, which can occur in standard RNNs when dealing with long sequences. LSTMs have memory cells that can store information over extended periods, making them suitable for tasks like natural language processing where long-term dependencies are important.

Artificial Intelligence: Foundations, Applications, and Ethics

AI, or artificial intelligence, uses advanced computer programs to mimic human thinking, enabling learning from data, complex problem-solving, and decision-making. Ongoing research aims to improve AI abilities and ensure its responsible use.

Key aspects of AI:
- Definition: AI is a branch of computer science focused on creating systems that perform tasks normally requiring human intelligence.
- Capabilities: These tasks include understanding natural language, recognizing patterns, making decisions, and learning from experience.
- Methods: AI employs methods like machine learning, deep learning, and natural language processing.
- Impact: It is revolutionizing fields like healthcare, finance, transportation, and entertainment.
- Challenges: AI development brings challenges such as data biases, ethical issues, and transparency concerns.
- Real-world applications:Cybersecurity: AI plays a vital role in cyber security by detecting threats.
- Content recommendation: AI enhances personalized entertainment experiences on platforms like Netflix and Spotify.
- Healthcare: AI is used for analyzing medical images and predicting health risks.
- Marketing: AI improves marketing strategies and customer experiences.
- Retail: AI personalizes shopping experiences and optimizes inventory management.
- Automotive Industry: AI is integral to design, development, and operation of vehicles.
AI is a broad field with different domains and branches, including machine learning, deep learning, natural language processing, robotics, expert systems, and fuzzy logic.
- Machine learning is a subset of AI that enables computers to make data-driven decisions and improve over time when exposed to new data.
- Deep learning and neural networks are also domains of AI.
Stages and Types of AI AI is structured along three evolutionary stages:
- Artificial Narrow Intelligence (ANI): Also known as weak AI, it focuses on specific tasks. Examples include Alexa and self-driving cars.
- Artificial General Intelligence (AGI): Also known as strong AI, it involves machines possessing the ability to think and make decisions like human beings.
- Artificial Super Intelligence (ASI): This is a hypothetical stage where computers’ capabilities surpass human intelligence.
AI can also be categorized into four types based on functionality:
- Reactive Machines AI: Operates based on present data without forming inferences.
- Limited Memory AI: Can make decisions based on past data.
- Theory of Mind AI: Focuses on emotional intelligence and understanding human thoughts, but is not yet fully developed.
- Self-Aware AI: Machines possess their own consciousness, which is a currently far-fetched concept.
History of AI The concept of AI dates back to classical ages with machines and mechanical men in Greek mythology.
- 1950: Alan Turing proposed the Turing Test to determine if a computer can think intelligently like a human.
- 1951: The era of game AI began with computer scientists developing programs for checkers and chess.
- 1956: John McCarthy coined the term “artificial intelligence”.
- 1959: The first AI laboratory was established at MIT.
Generative AI Generative AI is a type of AI that can produce new content, such as text, images, and audio.
- Applications: Generative AI has various applications across industries including text generation, language translation, business insights, music composition.
- Prompt Engineering: Prompt engineering involves creating effective prompts or instructions to guide AI systems to produce the expected outcome.
- It improves model performance, customization, and reliability.
- Clear and tailored prompts help AI models produce accurate and relevant content.
- Effective prompts should be clear, provide context, show examples, and be concise.
- Large Language Models (LLMs): Models like Google’s Sparm and Meta’s Llama drive applications such as chatbots and language translation by learning from data to predict and generate text sequences.
AI Ethics AI ethics refers to the principles and practices that ensure AI systems are developed and used ethically, without bias, and with transparency and accountability.
- Core Principles: Fairness, reliability and safety, privacy and security, accountability, and transparency.
- Implementation:
- Define goals and expectations for the AI.
- Collect necessary data and information.
- Select appropriate tools to enhance AI capabilities.
- Create fair and ethical models.
- Train the system to make ethical decisions.
- Evaluate the AI system to ensure fairness.
- Deploy the AI solution ethically.
AI in Business AI is transforming businesses by automating tasks, analyzing data, and predicting customer needs and market trends.
- Benefits: Efficiency, cost savings, personalization, and better decision-making.
- Use Cases:
- Marketing and Sales: AI personalizes marketing campaigns, recommends products, and generates content.
- Human Resources and Finance: AI streamlines recruitment, improves employee onboarding, detects fraud, and manages risk.
AI in Web Development AI is also transforming web development by simplifying workflows and boosting efficiency.
- AI Tools: Conversational AI (ChatGPT), AI-powered code suggestions (GitHub Copilot), AI website builders (Wix ADI), UI design tools (Galileo AI).
- Advantages: Automated testing, improved SEO, better user experience, and faster development.
AI in Manufacturing AI is transforming production processes in the manufacturing sector.
- Key Segments: Predictive maintenance, quality control and inspection, and supply chain management.
- Benefits: Energy efficiency, customization, and cost reduction.
Machine Learning: Definitions, Process, Types, Problems, and Tools

Machine learning (ML) is a subset of AI that enables computers to act and make data-driven decisions to carry out certain tasks. These programs or algorithms are designed to learn and improve over time when exposed to new data. The term “machine learning” was coined by Arthur Samuel in 1959.

Key aspects of machine learning:
- Definition: Machine learning provides machines with the ability to learn automatically and improve from experience without being explicitly programmed.
- Data-driven Decisions: ML enables computers to act and make decisions based on data.
- Algorithms: ML employs algorithms that learn and improve with exposure to new data.
- Relationship to AI: Machine learning is a subset of AI, focusing on algorithms that allow machines to learn from data.
- Solving Problems: The basic aim of machine learning is to solve problems or find solutions by using data.
The Machine Learning Process The machine learning process involves building a predictive model to find a solution for a particular problem. A well-defined machine learning process has around seven steps:
1. Defining the Objective: Understand what needs to be predicted.
2. Data Gathering/Collection: Collect data relevant to the problem.
3. Data Preparation: Prepare and preprocess the data.
4. Data Exploration/Exploratory Data Analysis (EDA): Understand patterns and correlations in the data.
5. Building a Machine Learning Model: Use insights from data exploration to build the model. Split the data set into training and testing data.
6. Model Evaluation and Optimization: Test the model’s efficiency using the testing data set.
7. Predictions: Use the model to make predictions.
Types of Machine Learning There are three main approaches to machine learning:
- Supervised Learning: Machines are trained using labeled data. Algorithms include linear regression, logistic regression, and support vector machines.
- Unsupervised Learning: Machines are trained on unlabeled data without guidance. K-means clustering is a common algorithm.
- Reinforcement Learning: An agent interacts with an environment to learn through trial and error, producing actions and receiving rewards. Q-learning is a key algorithm.
Types of Problems Solved by Machine Learning:
- Regression: The output is a continuous quantity (e.g., predicting the speed of a car).
- Classification: The output is a categorical variable (e.g., predicting rain occurrence).
- Clustering: Used in unsupervised learning to solve clustering problems.
Limitations of Machine Learning
- High Dimensional Data: ML algorithms struggle with high-dimensional data.
- Feature Extraction: ML requires manual feature extraction, which can be tedious.
Machine Learning Tools in Python
- TensorFlow: An open-source library developed by Google, used in machine learning applications. It helps visualize each part of a graph.
- Scikit-learn: A Python library associated with NumPy and SciPy, is useful for complex data analysis and feature extraction. It is used for implementing standard machine learning and data mining tasks like reducing dimensionality, classification, regression, clustering, and model selection.
- NumPy: Popular for machine learning tasks in Python.
- Keras: Runs smoothly on both CPU and GPU and supports neural network models, is completely Python-based.
- Natural Language Toolkit (NLTK): An open-source Python library mainly used for natural language processing, text analysis, and text mining.
Deep Learning: Definition, Functionality, Applications, and Tools

Deep learning is a particular kind of machine learning that is inspired by the functionality of brain cells called neurons, which led to the concept of artificial neural networks. It is based on the concept of neural networks. Deep learning models are capable of learning to focus on the right features by themselves requiring minimal human intervention, meaning that feature extraction will be performed by the deep learning model itself.

Key aspects of deep learning:
- Definition: Deep learning is a collection of statistical machine learning techniques used to learn feature hierarchies based on the concept of artificial neural networks.
- Neural Networks: Deep learning is based on neural networks with multiple layers.
- Feature extraction: Deep learning models are capable of learning to focus on the right features by themselves requiring minimal human intervention.
- Relationship to AI and ML: AI is a broader umbrella under which machine learning and deep learning come. Deep learning is a subset of machine learning and the next evolution of machine learning.
- Functionality: Deep learning mimics the basic component of the human brain called the brain cell, also known as a neuron. Inspired by a neuron, an artificial neuron was developed.
How Deep Learning Works Deep learning is implemented with the help of neural networks, and the motivation behind neural networks are neurons, which are brain cells. A deep neural network will have three layers:
- Input layer: Receives all the inputs.
- Hidden layers: Layers between the input and output layers.
- Output Layer: Provides the desired output.
The number of hidden layers in a deep learning network will depend on the type of problem and the available data.

Advantages of deep learning:
- Feature extraction: Deep learning models are capable of learning to focus on the right features by themselves requiring minimal human intervention. The model itself will learn which features are most significant in predicting the output.
- High dimensional data: Deep learning is mainly used to deal with high dimensional data and is often used in object detection and image processing.
Applications of Deep Learning
- Fraud detection: Deep learning is used to identify any possible fraudulent activities.
- Face verification: Facebook makes use of deep learning technology for face verification.
- Self-driving cars: Deep learning is used in self-driving cars.
- Object detection: Deep learning is used for object detection systems, enabling safe navigation and supports decision making models.
- Image Creation: Deep learning advances image creation, text generation, and audio synthesis within the field of generative AI.
- Medical field: Deep learning has applications for disease diagnosis by analyzing medical images and patient data.
Deep Learning Tools
- TensorFlow: A popular open source framework developed by Google for building and training machine learning models.
- Keras: The simplest package to implement neural networks. Keras runs smoothly on both CPU and GPU, supports neural network models, and is Python-based, making it easy to debug.
- PyTorch: Is more research-focused, favored for its dynamic computational graphs and ease of experimentations.
- Theano: Designed to handle computations required for large neural network algorithms.
Limitations of Machine Learning That Deep Learning Addresses
- High dimensionality of data: Deep learning models can generate the features on which the outcome will depend on.
- Manual feature extraction: Deep learning models are capable of learning to focus on the right features by themselves requiring little guidance from the programmer.
Python for AI, ML, and Data Science

Python is a popular programming language often used in the fields of AI, machine learning, and data science. It is considered the most popular and most used language for data science, AI, machine learning, and deep learning.

Key aspects of Python:
- Readability and Simplicity: Python’s syntax is similar to the English language, making it easy to learn and understand. Its simple syntax can be used to solve both simple and complex problems.
- Less Coding: Python requires less coding compared to other languages. Python uses something known as “check as you code” methodology, which eases the process of testing.
- Pre-built Libraries: Python has pre-defined libraries for machine learning and deep learning algorithms, making it convenient for AI developers because the algorithms are already prebuilt in libraries. Instead of coding each algorithm, you can call the function and load the library.
- Platform Independence: Python allows projects to run on different operating systems, with packages like Pi installer addressing dependency issues when transferring code between platforms.
- Massive Community Support: Python has many online communities, forums, and Facebook groups that can help with errors or problems in the code.
Python Packages for AI, ML, and NLP:
- TensorFlow: An open-source library developed by Google, commonly used for machine learning projects. It allows easy visualization of each part of the graph.
- Scikit-learn: A Python library associated with NumPy and SciPy, useful for complex data analysis and feature extraction. It is used for implementing standard machine learning and data mining tasks like reducing dimensionality, classification, regression, clustering, and model selection.
- NumPy: A popular library for machine learning in Python, used internally by TensorFlow and other libraries for performing multiple operations on tensors. Its array interface supports multi-dimensional arrays. NumPy makes complex mathematical implementations simple and is known for computing mathematical data.
- Theano: A computational framework used for computing multi-dimensional arrays that works similarly to TensorFlow. It was designed to handle the types of computations required for large neural network algorithms and is considered an industry standard for deep learning research and development.
- Keras: A popular Python package with functionalities for compiling models, processing data sets, and visualizing graphs. It is simple to implement neural networks with Keras, which runs smoothly on both CPU and GPU.
- Natural Language Toolkit (NLTK): An open-source Python library mainly used for natural language processing, text analysis, and text mining.
To set up Python for AI development:
1. Install Python: Download the latest version of Python from the official website and follow the installation instructions. Make sure to add Python to the system path during installation.
2. Install PyCharm: Download and install PyCharm, an IDE (Integrated Development Environment), from JetBrains. Choose the Community Edition, which is open source.
3. Configure PyCharm: During the PyCharm setup, create a desktop shortcut, update the content menu, and update the path version.
4. Connect Python with PyCharm: Open PyCharm and create a new project. Set the environment to a virtual environment and select the Python version.
5. Write your first Python program: Right-click on the new project, select “New,” and choose “Python File”. Give the file a name (e.g., “demo.py”) and press Enter. Then, type print(“Hello, World!”) and run the code.
TensorFlow: An Overview of Google’s Machine Learning Framework

TensorFlow is a powerful open-source machine learning framework developed by Google and is a toolkit for creating artificial intelligence systems. TensorFlow is a versatile platform that empowers developers to seamlessly transform AI and ML ideas into scalable solutions.

Key aspects of TensorFlow:
- Versatility and Flexibility: TensorFlow enables developers to build a wide range of models with customizable implementations. It offers APIs ranging from high-level Keras for simplicity to low-level APIs for advanced customization, catering to diverse developer needs.
- Scalability: TensorFlow’s scalability allows it to handle massive datasets and complex models efficiently, making it ideal for large-scale AI systems in applications like image recognition and natural language processing.
- Ecosystem: TensorFlow has a large and established ecosystem with an active community, extensive documentation, and a proven track record. Its rich ecosystem includes pre-trained models and numerous resources that simplify its adaptation and usage.
- Cross-platform support: TensorFlow enables seamless deployment across different operating systems and hardware platforms.
- Optimized Performance: TensorFlow runs efficiently on CPUs, GPUs, and TPUs, ensuring faster training and inferences times.
- Tensors and Computational Graphs: TensorFlow utilizes tensors (multi-dimensional arrays) and computational graphs to perform operations, making it adaptable and scalable for various machine learning tasks.
- Visualization and Debugging Tools: TensorFlow features visualizations and debugging tools that enhance model understanding and troubleshooting.
Key Capabilities of TensorFlow:
- Open Source and Community-Driven: TensorFlow is an open-source community-driven framework that evolves through contributions.
- Tensors: TensorFlow utilizes tensors, multi-dimensional arrays, for efficient data representation and manipulation.
- Flexible Architecture: Its flexible architecture allows developers to choose between static graphs for optimized performance and eager execution for an interactive development experience.
- Versatility: TensorFlow supports a wide range of applications including natural language processing, generative AI, computer vision, and more.
- Cross-Platform Compatibility: It offers cross-platform compatibility, running efficiently on CPUs, GPUs, and TPUs, enabling developers to leverage the best hardware for their needs.
Real-World Applications of TensorFlow:
- Computer Vision: TensorFlow is used for identifying objects in images with algorithms like YOLO and SSD, enabling tasks such as detecting pedestrians and obstacles in self-driving cars or identifying suspicious objects in security systems. It aids in analyzing X-rays or MRIs to detect anomalies and assist in monitoring deforestations, identifying land use patterns, and predicting natural disasters. Additionally, TensorFlow powers security systems and user authentication, enabling facial recognition for tasks like facial deduction and identifications.
- Natural Language Processing: TensorFlow is instrumental in tasks like spam detection and sentiment analysis, where it helps identify spam emails and determine the emotional tone of text, such as customer reviews or social media posts. It provides services like Google Translate, enabling accurate translations between numerous languages and facilitating global communications.
- Generative AI: TensorFlow powers GANs and similar models, enabling the creation of realistic images, art, and even the manipulation of existing visuals. It facilitates the creation of deep fake audio and speech generation, producing synthetic media where a person’s likeness or voice can be convincingly replicated.
- Healthcare: TensorFlow is used for predicting analytics to forecast disease outbreaks, identify high-risk patients, and optimize treatment plans, as well as for medical image analysis to detect anomalies in X-rays, MRIs, and CT scans.
- Autonomous Vehicles: TensorFlow powers object detection systems, enabling safe navigation and supports decision-making models.
- Finance: TensorFlow is utilized for algorithmic trading, analyzing market trends, and detecting fraudulent transactions.
- Retail: Retail applications include inventory management to predict demand and reduce stockouts along with personalized recommendations to enhance customer experience and boost sales.
- Entertainment: TensorFlow facilitates content creation, such as generating music or art, and it is used in video and audio processing tasks like noise reduction and voice stabilization.
Comparison with Other Frameworks: TensorFlow is known for its flexibility and versatility, enabling developers to build a wide range of models with customizable implementations, while PyTorch is recognized as intuitive and Pythonic, offering a user-friendly approach. TensorFlow excels with robust tools for deploying models in real-world environments, whereas PyTorch is more research-focused, favored for its dynamic computational graphs and ease of experimentation.

Installing TensorFlow: To get started with TensorFlow, you first need to install the necessary prerequisites, including Python 3.5 or a higher version. You can use a package manager like pip. To install TensorFlow, you can run pip install tensorflow. To ensure that TensorFlow has been installed successfully, you can verify the installation by running the following command: python -c “import tensorflow as tf; print(tf.__version__)”. This will display the installed version of TensorFlow, confirming that the installation was successful.

TensorFlow Ecosystem: The TensorFlow ecosystem provides a comprehensive set of tools for building, training, and deploying machine learning models. At its core is TensorFlow, the foundation of the ecosystem. TensorFlow Lite enables running models on mobile and embedded devices, while TensorFlow Extended supports building production-grade ML pipelines, including data validation and model serving. The TensorFlow Model Garden offers pre-trained models and examples for tasks like image classification and NLP. TensorFlow.js allows running ML models in web browsers, and TensorFlow Hub provides a library of pre-trained models for easy integration into projects.

Building a Churn Prediction Model Using TensorFlow: There are three main steps in building a churn prediction model using TensorFlow:
- Model Creation: Create a model by defining its architecture, including layers and parameters tailored to the specific problem of prediction customer churn.
- Model Training: Train the model using historical data, where it learns patterns and relationships that help predict customer behavior.
- Prediction: Use the trained model to make predictions, identifying customers likely to churn based on input data.
Artificial Intelligence Full Course – 10 Hours | Artificial Intelligence Tutorial 2025 | Edureka

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 17, 2025
CES 2025: Robotics, EVs, and Beyond
CES 2025 showcased numerous technological advancements, primarily in robotics, with companies unveiling humanoid robots for various applications, including healthcare, education, and entertainment. Electric vehicles and sustainable transportation were also highlighted, with several companies presenting innovative concepts and prototypes. Additionally, the exhibition featured advancements in artificial intelligence, demonstrated through robots capable of complex tasks and natural interactions. Finally, innovative gadgets like robotic baristas, a flying scooter, and a unique unicycle further highlighted the breadth of technological innovation present.

CES 2025 Study Guide

Short Answer Quiz
1. What were the key themes of the CES 2025 exhibition?
2. Describe the primary features of RealBtics’ robot “Melody,” including its price.
3. How does RealBtics’ robot “Arya” use its eyes and what is unique about its design?
4. What is the purpose of the GR1 robot, developed by the Chinese company Forier Intelligence?
5. How does the Mage robot pianist create music and where might it find applications?
6. What is unique about the robotic surgeon Alfred from Engineered Arts and what roles could it fill in the future?
7. What are the key features of the Honda Zero SUV and when is its expected release date?
8. Describe the Aptera solar electric vehicle, including its range and charging capabilities.
9. What makes the Sense Robot Chess unique?
10. What are some of the key features of the Open Droids home robot R2D3?
Answer Key
1. The key themes of the CES 2025 exhibition included artificial intelligence, digital healthcare, energy transition, mobility, quantum technology, and sustainability.
2. Melody is a humanoid robot with enhanced functionality, adaptability, and an improved user experience, featuring patented skin technology for a realistic appearance. It is priced at approximately $150,000.
3. Arya’s eyes are hidden cameras used for visual recognition to identify people and objects and its face is magnetically detachable for easy replacement allowing for different appearances.
4. The GR1 robot is designed to assist in medical care, rehabilitation, and provide support for elderly individuals and people with disabilities.
5. The Mage robot pianist uses advanced robotics and AI to analyze and reproduce music, and it could be used in educational institutions, concert halls, and museums.
6. Alfred is an ultra-realistic robot that can engage in free conversations, understand many languages and accents, and could replace humans in fields like tourism and hotel administration.
7. The Honda Zero SUV is a midsize electric crossover built on a new EV architecture with a minimalist design and level-three autonomous driving system. Its expected release is in the first half of 2026.
8. The Aptera is a solar electric vehicle that can travel up to 65 km per day solely on solar energy and up to 650 km when fully charged.
9. Sense Robot Chess adapts to the player’s skill level, offers multiple game modes, and can be used remotely with other users.
10. The R2D3 is designed to manage household appliances, prepare meals, and learn from user preferences, integrating with smart home systems and providing voice control.
Essay Questions
1. Discuss the ethical implications of creating human-like robots, drawing on examples from the CES 2025 exhibits.
2. Analyze the role of artificial intelligence in the robotic innovations showcased at CES 2025, and how AI impacts the capabilities of these machines.
3. Compare and contrast the different types of robots presented at CES 2025, considering their purpose, design, and potential impact on society.
4. Evaluate the significance of electric vehicle innovations presented at CES 2025 for the future of transportation and their environmental impacts.
5. Examine how the advancements displayed at CES 2025 could address current societal challenges like healthcare, labor shortages, and mental health.
Glossary of Key Terms
- Artificial Intelligence (AI): The simulation of human intelligence in machines, enabling them to learn, problem-solve, and make decisions.
- Humanoid Robot: A robot with a body shape that resembles a human, often designed to interact with human environments and perform human-like tasks.
- Level Three Automated Driving System: A system that enables a vehicle to perform driving tasks autonomously under certain conditions, but requires the driver to be ready to take control.
- Electric Vehicle (EV): A vehicle that is powered by electricity rather than an internal combustion engine, usually using batteries.
- Solar EV: A vehicle that uses solar panels to recharge its batteries.
- Bionic Prosthesis: An artificial limb or body part that incorporates electronic or mechanical components to mimic natural movement and function.
- Tactile Feedback: The sensation of touch or pressure that is relayed back to the user through sensors, especially in robotic or prosthetic devices.
- Generative AI: Artificial intelligence that is able to generate new content such as text, images, music, or code.
- Mobility: The ability to move easily and freely, which is related to both transportation and the physical capabilities of robots.
- Autonomous Robot: A robot that can operate independently without human control or intervention after initial programming.
- Augmented Reality (AR): A technology that overlays digital information onto a user’s view of the real world, enhancing or modifying it.
- Omnidirectional Wheels: Wheels designed to move in any direction, enabling a vehicle or robot to move sideways, forward, or at an angle.
- Large Language Models (LLMs): A type of artificial intelligence that can learn to generate and understand language, which can improve a machine’s ability to interact with people.
CES 2025: Future Technologies Unveiled

Okay, here’s a detailed briefing document summarizing the key themes, ideas, and facts presented in the provided text about CES 2025:

CES 2025: A Deep Dive into Future Technologies

Executive Summary:

CES 2025 showcased a plethora of innovations across various tech sectors, highlighting significant advancements in robotics, artificial intelligence (AI), sustainable energy, electric vehicles (EVs), and personal mobility. The event emphasized a move toward more human-like robots, AI-powered personal assistants, and eco-conscious transportation. The sheer scale of the event, with over 4,500 companies from 160+ countries, underscores the rapid pace of technological development and the global collaboration driving it.

Key Themes:
1. Robotics & Artificial Intelligence:
- Humanoid Robots Taking Center Stage: A significant focus was on creating robots with human-like appearance, interaction capabilities, and emotional intelligence.
- Real bics’ Melody: “The robot is designed to closely resemble humans in appearance and social interaction… a highlight of Melody is its patented skin technology… which offers an exceptionally realistic appearance and tactile feel.”
- Real bics’ Arya: Designed to “engage in… more intimate conversations” and can be customized with interchangeable faces.
- Engineered Arts’ Ameca: Renowned for “realistic facial expressions and advanced communication skills.”
- Intbot’s Nyo: Designed for “personal use and interaction… with sarcasm, making Its Behavior resemble that of a teenager.”
- AI-Powered Interaction: Robots are increasingly using AI for natural language processing, emotion recognition, and adaptive behavior.
- NTT’s humanoid controlled by “Suzumi”: Demonstrates “advanced capabilities of generative artificial intelligence which not only understands and processes commands but also performs physical tasks.”
- Alfred the Robotic Surgeon: Can “engage in free conversations… answer any question… and is capable of understanding almost all languages of the world.”
- Robo Rock’s Saros z70: “Equipped with an updated navigation system… which allows it to efficiently avoid obstacles and learn to recognize new objects”.
- Agility Robotics’ Digit: Actively exploring the integration of large language models to program itself in response to verbal commands.
- Diverse Applications of Robots: Robots are being developed for a wide range of uses including healthcare, education, entertainment, personal assistance, industrial tasks, and even artistic performance.
- Forier Intelligence’s GR1: “Its primary purpose is to assist in Medical Care, Rehabilitation and provide support for elderly individuals.”
- Mage’s Robot Pianist: Capable of “Performing complex musical pieces with exceptional Precision.”
- Tombot’s Jenny: Designed “to support individuals with dementia, depression and other mental health disorders.”
- Enchanted Tools’ Moroi: “Moves on a spherical base, navigates spaces with precision, and can grasp objects with a success rate of 97%” and performing tasks such as delivering water and PPE at a care home.
- Stardust Intelligence’s Robotic Baristas: Equipped with “high Precision manipulators that amazed visitors with their coffee making abilities”.
- Advanced Robotic Capabilities: Robots showcased include enhanced movement, dexterity, emotional responsiveness, and the ability to adapt to varied environments.
- Ethical Considerations: The development of robots with advanced human-like traits raises questions regarding their purpose, potential impact on human society, and the nature of human-robot relationships.
1. Sustainable Energy & Transportation:
- Electric Vehicles (EVs) Dominate: The future of transportation is clearly electric, with numerous new EV concepts and production models.
- Honda Zero SUV and Saloon: Futuristic EV concepts with “level three automated Driving Systems.”
- LG’s Alpha-able Concept Car: A “Digital Cave that adapts to the needs of the passengers” with customizable interior.
- Mercedes-Benz Concept CLA Class: An “electric sedan… built on the new Mercedes-Benz modular architecture platform” with ultra fast charging capabilities.
- Aptera Solar EV: “The ability to travel up to 65 km a day entirely on solar energy.”
- Kia EV4: a compact electric car with up to 600km range.
- Focus on Efficiency & Range: Emphasis on maximizing range, charging speed, and energy efficiency in EVs.
- Mercedes-Benz Concept CLA Class: Achieves “exceptional efficiency through a 235 horsepower electric motor and an Innovative silicon oxide anode battery”.
- Aptera Solar EV: Solar panels enable daily range while the small battery extends total range to 650km.
- Alternative Transportation Modes: The exhibition showcased unique concepts like flying cars and electric scooters.
- Xong Arrow’s Xun X3: “Flying car with vertical takeoff and landing capabilities… can operate both on roads and in the air”.
- RoR’s Sky Rider X1: A “Flying electric scooter… can transform into a quadcopter.”
- Advancements in Battery Technology: With many vehicles noting “800 volt electric system enabling Ultra fast charging,” indicating improvements in battery technology are crucial for the advancement of EVs.
- Mercedes-Benz Concept CLA Class: “800 volt electric system enabling Ultra fast charging at up to 250 Kow adding 400 km of range in just 15 minutes.”
1. Personal Mobility & Smart Living:
- Home Robots & Assistants: Robots are becoming more integrated into daily life, providing convenience and assistance.
- Open Droids’ R2D3: “Designed to simplify everyday life… managing household appliances and even preparing meals.”
- Gai’s Mimo: “Visually it resembles a walk-in coffee table with a lamp… meant to ensure autonomy and the ability to move independently.”
- Robo Rock Saros Z70: a robot vacuum equipped with a robotic arm to handle small objects around the home.
- Personalized Experiences: Emphasis on creating technology that adapts to individual needs and preferences.
- Smart Home Integration: Robots are designed to seamlessly integrate with existing smart home ecosystems.
- Open Droids’ R2D3: integrates “seamlessly with existing smart home systems including voice assistants like Alexa and Google Assistant.”
- Advanced Prosthetics: Progress in bionic prosthetics offer those with disabilities a high level of functionality and natural movement.
- Psionic’s Robotic Prosthetic Hand: “Combines Advanced engineering Solutions and Innovative materials providing the user with a unique level of functionality and natural movement.”
1. Advanced Graphics & Gaming:
- Nvidia’s GeForce RTX 50 Series showcased the capabilities of next gen graphics cards.
- Sony’s Exo Suit: Inspired by the video game Horizon Zero Dawn, shows “realistic movement mechanics, intricate structure and striking design”.
1. Other Notable Innovations:
- Sense Robot Chess: A chess playing robot with “ability to adapt to the skill level of the player.”
- Species Corporation’s Kosaka Kakona: An “emotional robotic mannequin” that “demonstrates incredibly smooth and graceful movements”
- In Motion’s unicycle model with a powerful motor and extended range
- Booster Robotics’ T1: A “flexible and Agility robot… demonstrating dance movements and Kung Fu moves”.
Key Takeaways:
- Convergence of Technologies: The event highlights the convergence of robotics, AI, sustainable energy, and personal mobility, indicating a future where these technologies are deeply intertwined.
- Human-Centered Design: There is a strong emphasis on creating technologies that enhance human experiences, whether through personal robots, customizable transportation, or advanced prosthetics.
- Global Innovation: The large number of international companies participating showcases the truly global nature of technological advancement.
- Ethical Considerations: As technologies become more powerful, ethical considerations surrounding their development and application are increasingly important.
- The Pace of Progress: The sheer number of innovative products and technologies on display underscore the ever increasing rate of technological progress across multiple fields.
Conclusion:

CES 2025 provided a glimpse into the future, showcasing groundbreaking innovations poised to transform our lives in the coming years. The event highlighted the rapid advancements in robotics and AI, the push for sustainable mobility, and the growing integration of technology into our homes and personal lives. It’s clear that the tech industry is rapidly moving towards a more connected, personalized, and sustainable future. The ethical implications of these innovations remain a crucial area of discussion.

CES 2025: Technological Innovations

CES 2025: Frequently Asked Questions
1. What were the major technological themes and innovations showcased at CES 2025? CES 2025 featured a wide range of innovations, with key themes including artificial intelligence, digital healthcare, energy transition, mobility, quantum technology, and sustainability. The exhibition highlighted advancements in robotics, electric vehicles, personal transportation, and AI-driven tools across various sectors, from healthcare to entertainment.
2. Several humanoid robots were unveiled. What are some key features and differences between robots like Melody, Arya, Ameca, and GR1? Several humanoid robots made their debut, each with distinct features. Melody, by real bics, focused on realistic appearance and tactile feel with its patented skin technology and is designed for industries needing close human interaction. Arya, also by real bics, can have its face swapped, features visual recognition and is intended for more intimate conversations. Ameca by Engineered Arts is known for its realistic facial expressions and conversation skills, designed for scientific research. GR1, by Forier Intelligence, is a Chinese-developed robot focused on medical care and support for the elderly, boasting mobility and object recognition capabilities.
3. Beyond traditional humanoid robots, what other types of robots were highlighted and what were their intended purposes? CES 2025 showcased diverse robots beyond humanoids. Examples include: a robot pianist by Mage that can perform complex musical pieces, robot baristas, and a robotic prosthetic hand by psionic with tactile feedback capabilities. There were also robots focused on home assistance such as the R2D3, and support for those with mental health issues such as Jenny, the robotic puppy. Specialized robots, such as digit, were intended for industrial and logistical uses.
4. What were some of the most interesting innovations in personal and public transportation at CES 2025? Transportation at CES 2025 saw several groundbreaking concepts. Honda showcased prototypes of its upcoming electric crossover and sedan with advanced driver-assistance systems and a new OS, while LG presented a concept car focused on personalization and comfort. Aptera revealed a solar electric vehicle capable of traveling 65 km per day on solar power. There were flying car concepts, including the Xong X3 which has vertical takeoff and landing capabilities, and a flying electric scooter, the Sky Rider X1, which has a closed cabin that transforms into a quadcopter.
5. How is Artificial Intelligence being integrated into the showcased technologies? AI was a core component across various CES innovations. It was used in robots for facial recognition, emotional responses, advanced language processing, and dynamic decision making. AI also powered driving systems, assisted in personalized in-car experiences, enhanced musical performances, and provided assistance for everyday tasks, emphasizing its pervasive influence across sectors.
6. Besides the robots, what other unusual or innovative technologies were featured at CES 2025? Beyond robots and transportation, CES 2025 featured a range of unusual innovations. These included a customizable robot model named Mimo, that resembles a walk-in coffee table, an Exo Suit inspired by a game, and a chess-playing robot called Sense Robot Chess, that adapts to the skill level of its opponent. A robotic arm called Omni Grip, was included in a robot vacuum cleaner and there was also a robotic system with manipulators that can make coffee, and robotic dogs capable of jumping and performing acrobatic tricks.
7. What advancements were made in electric vehicle technology and how are manufacturers addressing range and charging concerns? Several manufacturers, including Mercedes-Benz, Honda, and Kia presented new electric vehicle concepts. Mercedes-Benz’s Concept CLA Class included an 800 volt system allowing for 400 km range in 15 minutes, with a 750km total range. Kia showcased their EV4, emphasizing quick charging and long range options. There was an emphasis on battery technology, charging efficiency, and integration of technology, and companies demonstrated that they are increasingly focusing on improving performance and convenience for users.
8. What role did sustainability play at CES 2025 and how was this theme integrated into various technologies? Sustainability was a notable theme at CES 2025. Solar electric vehicles, like the Aptera, showcased direct energy generation. Electric vehicles from Honda, LG, and Mercedes-Benz demonstrated an increasing shift towards environmentally friendly options. Companies also highlighted their efforts to incorporate sustainable materials, energy efficiency, and eco-conscious design principles into various technologies, emphasizing a commitment towards reducing environmental impact.
CES 2025: Innovations in Robotics, AI, and Transportation

CES 2025 showcased a wide array of innovations across various technology sectors. Here are some of the highlights:

Robotics and AI:
- Humanoid Robots: Several companies unveiled advanced humanoid robots, including RealBiotics with Melody, priced at $150,000, which has realistic skin and can be customized for different uses. They also presented Arya, a robot with swappable faces. Engineered Arts showcased Amecca, known for its realistic facial expressions, and also Alfred, a robot doctor capable of engaging in conversations and understanding multiple languages. Other notable humanoid robots include GR1 from Forier Intelligence, designed for medical care and assistance, and Nyo from intBot, a robot designed for personal interaction with emotional intelligence.
- Other Robots: There were many other types of robots demonstrated at CES 2025. A robotic pianist from Mage, capable of playing complex musical pieces. A robotic dog from Deep Robotics that can perform acrobatic tricks and navigate complex terrain. There were also robotic baristas from Stardust Intelligence, and robotic arms for vacuum cleaners from Robo Rock. There were also robots for home and personal use, such as r2d3, from Open Droids, and moroi from Enchanted Tools. Tombot presented Jenny, a robotic puppy intended to support those with dementia, depression and other mental health disorders.
- AI Integration: Many robots at CES 2025 utilized advanced AI. Some were integrated with language models like Suzumi, and could perform physical tasks simulating physical sensations. Agility Robotics is also exploring integrating large language models into their robot, Digit.
Transportation:
- Electric Vehicles: Honda presented prototypes of its electric crossover, the Honda Z SUV, and sedan, the Honda zero Saloon. LG showcased a concept car, Alpha able, with a customizable interior. Mercedes-Benz unveiled its Concept CLA Class, an electric sedan with a focus on efficiency and fast charging. There were new electric vehicle models announced from Kia, the EV4 and several concept vehicles, including the quintessenza electric pickup from eedle design .
- Solar Electric Vehicles: Aptera Motors displayed its solar electric vehicle, the Aptera, which can travel up to 65 km per day using solar energy.
- Flying Vehicles: Several flying vehicles were on display, including the xun X3 flying car from Xong Arrow, the AOH HT flying car, also from Xong, and the Sky Rider X1, a flying electric scooter from RoR.
Other Notable Technologies:
- Prosthetics: Psionic showcased an advanced robotic prosthetic hand with tactile feedback.
- Exosuits: Sony showed off a full-size exo-suit inspired by the game Horizon Zero Dawn.
- Home Robots: Many home robots were introduced with varied functionality, such as R2d3 from Open Droids that can manage household appliances, and mimo from gai, a customizable universal robot.
- Gaming Robots: Sense Robot Chess was presented with an AI that adapts to the skill level of the player.
- Graphics: Nvidia showed the capabilities of its GeForce RTX 50 Series graphics cards.
Key Themes:
- Artificial Intelligence: AI was a central theme, with applications ranging from humanoid robots to vehicle operating systems and even robot chess.
- Sustainability: Many companies focused on sustainable solutions, particularly in transportation and energy.
- Human-Robot Interaction: There was an emphasis on creating robots that can interact with humans naturally and provide support in various settings.
This comprehensive overview should give you a solid understanding of the key innovations and trends highlighted at CES 2025.

Humanoid Robots at CES 2025

CES 2025 featured a diverse range of humanoid robots, each with unique capabilities and intended applications.

Key Highlights of Humanoid Robots at CES 2025:
- RealBiotics’ Melody: This robot is designed to closely resemble humans in appearance and social interaction. It features patented skin technology, compatibility with various AI platforms (including ChatGPT), and is priced at approximately $150,000. Melody is intended for industries requiring close human interaction, such as healthcare, education, and entertainment. RealBiotics also presented Arya, a less expensive model ($75,000) that can move and has a swappable face. Arya’s eyes are hidden cameras for visual recognition, and it is capable of more intimate conversations.
- Engineered Arts’ Amecca: This robot is known for its realistic facial expressions and advanced communication skills. It can display a range of human emotions and engage in conversations, making it suitable for scientific research and showcasing AI potential. Amecca is priced at approximately $175,000. Engineered Arts also presented Alfred, a robot doctor, capable of free conversation and understanding multiple languages.
- Forier Intelligence’s GR1: This Chinese-developed robot is designed for medical care, rehabilitation, and support for the elderly and people with disabilities. It stands 1.65m tall, weighs 55kg, and can move at 5 kmph, ascend stairs, and carry loads. GR1 is equipped with AI for face, speech, and object recognition.
- IntBot’s Nyo: This robot is designed for personal use and interaction in a home environment. Nyo is notable for its ability to express emotions and communicate informally, sometimes with sarcasm, using multimodal language learning based on AI. Nyo is intended as a companion for people, providing emotional support and social interaction.
- NTT’s Humanoid Robot: This robot is controlled by the next-generation language model Suzumi and can perform physical tasks while simulating physical sensations. It can sense its environment, bringing it closer to mimicking human movements and decision-making.
- Agility Robotics’ Digit: This humanoid robot is designed to work alongside humans in tasks such as lifting and transporting objects. It can carry loads up to 16kg and operate autonomously for up to 16 hours. Agility Robotics is exploring the integration of large language models to allow Digit to respond to verbal commands.
- Engine AI’s SE01, SA01, and PM01: These robots are designed for diverse tasks and markets. SE01 is a multi-purpose robot for industrial tasks, SA01 is for research and education, and PM01 is a lightweight and highly dynamic robot with an open architecture.
- Booster Robotics’ T1: This robot demonstrated impressive flexibility and agility by performing various physical exercises such as push-ups, dance movements, and Kung Fu moves. It is equipped with high-tech sensors and a motion control system, and it can also walk, bend and kick a soccer ball. Booster Robotics suggests that this robot could replace athletes.
- Pollen Robotics’ I2: This open source humanoid robot, also called Richi2, is designed for interaction with the environment and people. It can manipulate objects with 7-degree of freedom manipulators and can move freely using omnidirectional wheels.
Common Themes and Features:
- Advanced AI: Many of these robots incorporate advanced artificial intelligence for natural language processing, visual recognition, and adaptive learning.
- Human-Like Interaction: A key focus is on creating robots that can interact with humans naturally, including the ability to understand and express emotions.
- Diverse Applications: The robots are designed for a wide range of uses, including healthcare, education, entertainment, industrial tasks, personal assistance, and companionship.
- Realistic Design: Many companies are focusing on making robots that look and feel more human.
- Mobility: The robots also feature different methods of mobility, from walking and running to wheels, and a spherical base.
These humanoid robots at CES 2025 represent significant advancements in robotics and AI, highlighting the potential for these technologies to impact various aspects of daily life and industries.

CES 2025: Electric Vehicle Innovations

CES 2025 showcased a variety of advancements in electric vehicle technology, with several manufacturers presenting new models and concepts.

Key Electric Vehicles and Concepts at CES 2025:
- Honda: Honda revealed prototypes of its upcoming electric vehicles, the Honda Z SUV and the Honda Zero Saloon. The Honda Zero SUV is a midsize crossover built on a new electric vehicle architecture and has a minimalist design. The Honda Zero Saloon is a flagship sedan characterized by a low height and wedge-shaped design. Both models will have Level Three automated driving systems and a new operating system called Asimo OS. Honda plans to begin mass production in North America in the first half of 2026, followed by Japan and Europe.
- LG: LG presented the Alpha able concept car, a vision for future mobility that combines advanced technologies with personalization, turning the car into a “digital cave” that adapts to passenger needs. The interior of the vehicle can be reconfigured using flexible, foldable, and transparent screens. LG also introduced new solutions for electric vehicle charging, such as the Ecentric system.
- Mercedes-Benz: Mercedes-Benz unveiled its Concept CLA Class, an electric sedan that marks a new era for the brand. This concept is built on the new Mercedes-Benz modular architecture platform and features an 800-volt electric system for fast charging, adding 400 km of range in 15 minutes. It has a total range exceeding 750 km and achieves efficiency through a 235-horsepower electric motor and a silicon oxide anode battery.
- Aptera Motors: Aptera Motors showcased the Aptera, a solar electric vehicle covered with solar panels. It can travel up to 65 km per day on solar energy and has a total range of 650 km with a fully charged 45 kWh battery. The company is taking pre-orders for the model, which is priced at $40,000.
- Kia: Kia introduced the EV4, a compact electric vehicle equipped with batteries that provide a range of up to 600 km. The EV4 supports fast charging, allowing the battery to recharge from 10% to 80% in 30 minutes. Production is set to begin in March 2025.
- Eedle Design: Eedle Design unveiled the electric pickup concept Quintessenza, which is not intended for mass production but demonstrates design capabilities for potential partners. The pickup features three motors, a 150 kWh battery and can accelerate to 100 kmph in less than 3 seconds.
- BMW: BMW showcased its Innovative Concepts from the Noya classa lineup which included a fully electric sedan and SUV. These vehicles are built on the new Noya classa platform and feature the BMW panoramic Vision system, projecting data across the entire width of the windshield. The vehicles are equipped with the sixth generation of BMW’s eDrive system with new battery architecture with round cells providing up to 30% more range and faster charging.
Key Themes and Trends:
- Sustainability: A significant emphasis was placed on sustainable solutions, with several companies focusing on electric and solar-powered vehicles.
- Advanced Charging Technology: Many new electric vehicles featured fast-charging capabilities and innovative charging systems.
- Innovative Design: Automakers showcased unique and futuristic designs that blended elegance with aerodynamic optimization.
- Integration of Technology: Electric vehicles at CES 2025 also featured integration of technology, such as advanced operating systems, interactive displays, and automated driving systems.
- Increased Range: Many of the new electric vehicle models have batteries that provide increased range.
These developments highlight the continued progress in electric vehicle technology, with a focus on increased efficiency, range, and sustainability. The presentations at CES 2025 point towards a future where electric vehicles will be more practical, technologically advanced, and integrated into daily life.

CES 2025 Robotics Showcase

CES 2025 featured a wide array of advancements in robotics technology, showcasing innovations across various applications, from personal assistance to industrial automation.

Humanoid Robots
- RealBiotics unveiled Melody, a robot designed to resemble humans with advanced skin technology and AI integration, priced at $150,000. They also showcased Arya, a more affordable model at $75,000, with visual recognition and swappable faces.
- Engineered Arts presented Amecca, known for its realistic facial expressions and communication skills, priced at $175,000, as well as Alfred, a robot doctor capable of free conversation and understanding multiple languages.
- Forier Intelligence introduced the GR1, a Chinese robot designed for medical care and support, capable of moving, carrying loads, and interacting with people using AI.
- IntBot’s Nyo, designed for personal use, can express emotions and communicate in an informal style, using multimodal AI.
- NTT’s humanoid robot demonstrated the ability to perform physical tasks while simulating physical sensations, controlled by the language model Suzumi.
- Agility Robotics’ Digit is designed to work alongside humans for tasks such as lifting and transporting objects, and it can operate autonomously for up to 16 hours.
- Engine AI showcased three models: SE01 for industrial tasks, SA01 for research, and PM01, a dynamic robot with an open architecture.
- Booster Robotics’ T1 demonstrated impressive flexibility and agility, performing exercises and Kung Fu, and has the potential to replace athletes.
- Pollen Robotics’ I2, also called Richi2, is an open source humanoid robot designed for interaction with people and the environment and is capable of manipulating objects using advanced manipulators.
Other Notable Robots
- Mage presented a robot pianist capable of playing complex musical pieces with exceptional precision.
- Species Corporation introduced Kosaka Kakona, an emotional robotic mannequin with 37 movable joints, designed to add a kinetic dimension to displayed clothing.
- Tombot presented Jenny, a robotic puppy designed to support individuals with mental health disorders, acting like a real Labrador Retriever.
- Gai displayed the customizable universal robot model Mimo featuring AI, resembling a walk-in coffee table with a lamp.
- Open Droids unveiled R2D3, a home robot capable of managing appliances and preparing meals, adapting to the habits of its owners.
- Enchanted Tools’ Moroi is designed for homes, hospitals, and nursing homes, capable of navigating spaces and grasping objects, and is being tested at experimental sites and research labs.
- Stardust Intelligence showcased robotic baristas with high-precision manipulators.
- Sense Robot Chess is a chess-playing robot that adapts to the skill level of the player, offering various game modes.
- Kawasaki Heavy Industries presented Nyaki, an autonomous robot capable of opening doors and delivering drinks in hotels.
- Unry Robotics displayed the quadraped robot Unitree Go2 and the humanoid robot Unitree G1, highlighting their capabilities in movement and acrobatics.
- Deep Robotics showed a high-performance robotic dog with exceptional agility and terrain adaptability.
- Robo Rock announced the Saros Z70 robot vacuum, equipped with a robotic arm to pick up small objects, along with an updated navigation system.
Key Trends in Robotics Technology:
- Advanced AI Integration: Many robots incorporate AI for natural language processing, visual recognition, and adaptive learning.
- Human-Robot Interaction: There is a strong focus on creating robots that can interact naturally with humans, including understanding and expressing emotions.
- Diverse Applications: Robots are being developed for a wide range of uses, including healthcare, education, entertainment, industrial tasks, personal assistance, and companionship.
- Increased Realism: Many companies are focusing on making robots that look and feel more human.
- Enhanced Mobility: Robots are featuring various methods of mobility, from walking and running to wheels and spherical bases.
- Improved Functionality: Robots are being designed with an increasing number of sensors and better ways of manipulating their environments and interacting with humans.
These advancements in robotics technology at CES 2025 demonstrate the rapid progress in the field and the potential for robots to significantly impact various aspects of daily life and industry.

AI at CES 2025

CES 2025 showcased significant advancements in artificial intelligence (AI), which were integrated into various technologies, from robotics to vehicles.

Key AI Applications and Technologies at CES 2025:
- Robotics: Many robots at CES 2025 utilized advanced AI for diverse functions.
- RealBiotics’ Melody is integrated with AI to enhance human experiences through interaction, learning, and entertainment. It is also compatible with various AI platforms such as ChatGPT by OpenAI.
- Engineered Arts’ Amecca uses cutting-edge AI algorithms enabling it to engage in conversations and display a wide range of human emotions through meticulously crafted facial expressions. Alfred, also by Engineered Arts, can engage in free conversations thanks to a new AI-augmented system.
- Forier Intelligence’s GR1 uses AI to recognize faces, speech, and objects, enabling it to analyze its surroundings and interact with people in real time.
- IntBot’s Nyo uses multimodal language learning based on AI to express emotions and communicate in an informal style.
- NTT’s humanoid robot is controlled by the Next Generation language model Suzumi, which not only understands and processes commands but also performs physical tasks.
- Agility Robotics’ Digit is exploring the integration of large language models in AI, enabling it to program itself in response to verbal commands in natural language.
- Sense Robot Chess uses AI to analyze an opponent’s moves, adjust the difficulty, and provide helpful tips.
- Gai’s Mimo uses AI to ensure autonomy and the ability to move independently within a space.
- Language Models:
- NTT’s humanoid robot demonstrated the capabilities of the generative language model Suzumi.
- Agility Robotics is exploring the use of large language models to enable its robot Digit to program itself in response to verbal commands.
- Automated Driving Systems: Several electric vehicles featured Level Three automated driving systems that are capable of autonomous driving, braking and accelerating.
- Honda’s Z SUV and Zero Saloon will have level three automated driving systems.
- Personalization and Adaptation:
- Open Droids’ R2D3 adapts to the habits of its owners, learning from user preferences to perform tasks more efficiently over time.
- BMW introduced the updated panoramic ey Drive control system that displays interactive 3D content on the windshield creating a personalized interface.
- Emotional AI:
- IntBot’s Nyo uses AI to express emotions and communicate informally, even with sarcasm.
- Species Corporation’s Kosaka Kakona, a robotic mannequin, is designed to demonstrate smooth and graceful movements, adding a kinetic dimension to clothing displays.
- Data Analysis:
- Sense Robot Chess uses AI to analyze players’ moves, adjust difficulty, and give real-time tips.
- Multimodal AI:
- IntBot’s Nyo uses multimodal language learning based on artificial intelligence which allows Nyo to recognize and reproduce body language, facial expressions, eye contact and micro expressions.
Key Trends in AI at CES 2025:
- Integration with Robotics: AI is a key component in the functionality of many advanced robots, enabling them to interact with humans, adapt to their surroundings, and perform complex tasks.
- Natural Language Processing: There is a focus on AI that can understand and respond to human language, allowing for more intuitive and natural interactions.
- Emotional Intelligence: AI is being used to create robots that can understand and express emotions, enhancing their ability to interact with people on a personal level.
- Autonomous Operation: AI is being used to make systems such as robots and vehicles more autonomous.
- Personalized Experiences: AI is being utilized to create personalized experiences for users, whether it’s adapting to user preferences or creating customized interfaces.
The AI advancements at CES 2025 demonstrate how the technology is becoming more integrated into different aspects of our lives, enabling more sophisticated and personalized experiences, particularly in robotics, and vehicle technology, as well as in applications that require natural interaction with human language, as well as other aspects of human interaction such as emotion and personalization.

China’s New $2,000 Flying Shoes SHOCKED the World at CES 2025

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 16, 2025
Building a Chatbot with OpenAI
This tutorial teaches front-end web development using AI, specifically OpenAI’s API. The course covers building three applications: a movie pitch generator, a GPT-4 chatbot, and a fine-tuned customer support bot for a fictional drone delivery company. Key concepts explored include: prompt engineering, using different OpenAI models, handling API keys securely, and deploying to Netlify. The final project demonstrates fine-tuning a model with custom data to create a chatbot that answers company-specific questions accurately. The instructor emphasizes hands-on coding through numerous challenges.

AI Web Development Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What is the primary purpose of the movie pitch app, and what technology does it use to generate movie ideas?
2. Explain the concept of “fine-tuning” in the context of chatbot development.
3. What is a token in the context of OpenAI, and how does the max_tokens property affect text generation?
4. Describe the difference between the zero-shot approach and the few-shot approach in prompt engineering.
5. Why is it important to separate the instruction, examples, and requests when using the few-shot approach in prompt engineering?
6. What is the purpose of the temperature property in the OpenAI API?
7. What is the purpose of using “presence penalty” and “frequency penalty” when working with chatbots, and how do they differ?
8. Why is a Google Firebase database useful for a chatbot application?
9. What does it mean to persist a chat conversation, and how does Firebase achieve this?
10. Explain the purpose of a serverless function, and why it’s important for deploying an application that uses an API with a secret key.
Quiz Answer Key
1. The movie pitch app turns a one-sentence movie idea into a full outline. It uses OpenAI to generate human-standard words and images, creating artwork, titles, synopses, and potential cast members from a single line of input.
2. Fine-tuning involves uploading a custom dataset to train a chatbot to answer specific questions from that data. This skill is essential for using chatbots in specific roles, such as customer service.
3. A token is a small chunk of text, roughly 75% of a word, used by OpenAI for processing. The max_tokens property limits the length of the text output, preventing the model from generating overly long responses.
4. The zero-shot approach uses a simple instruction without any examples to ask for what is needed, while the few-shot approach uses one or more examples to guide the AI in providing more accurate and specific responses.
5. Separating instructions, examples, and requests helps the AI understand that it’s dealing with different parts of the prompt. It allows the AI to recognize the context of the instruction, the expected output format based on examples, and what task it is being asked to complete, thereby improving accuracy.
6. The temperature property controls the randomness of the text output. A lower temperature results in more predictable, factual responses, while a higher temperature results in more creative and varied outputs.
7. Presence penalty encourages the model to talk about new topics by increasing the likelihood of talking about new ideas and concepts rather than staying on one subject, whereas frequency penalty discourages the model from using the same words or phrases repeatedly in a given text generation.
8. A Google Firebase database is useful for a chatbot application because it can store the user’s chat history, which enables the user to start and continue conversations even after the browser is refreshed or closed. This is done by storing the user interactions.
9. Persisting a chat conversation means saving the conversation so that it can be resumed later. Firebase achieves this by storing the conversation data in its database, allowing the application to retrieve and display the conversation when the user returns to the site.
10. A serverless function allows you to run code in a cloud environment without managing servers. It’s important for deploying applications using APIs with secret keys because it hides the API key on the backend, thus preventing it from being exposed in the front-end code.
Essay Questions

Instructions: Answer the following questions in essay format, referencing information from the provided source.
1. Discuss the evolution of prompt engineering techniques presented in the course, from basic instructions to incorporating examples, and explain how these techniques can improve the output of AI models.
2. Explain the significance of controlling token usage and temperature in AI text generation, and how these properties affect the quality and consistency of AI-generated content.
3. Compare and contrast the use of the create completion endpoint and the create chat completion endpoint in the context of AI chatbot development, and discuss the advantages of each approach.
4. Analyze the process of fine-tuning an AI model with custom data, and discuss the steps involved in preparing the data, uploading it to the API, and testing the resulting model.
5. Evaluate the importance of security measures, such as using serverless functions and environment variables, when deploying web applications that use AI APIs with sensitive information.
Glossary of Key Terms

API (Application Programming Interface): A set of protocols and tools for building software applications. It specifies how software components should interact.

Chatbot: A computer program that simulates conversation with human users, either through text or voice interactions.

Completion: The text generated by an AI model as a response to a given prompt.

Environment Variable: A variable with a name and value defined outside the source code of an application, often used to store sensitive information such as API keys.

Epoch: A complete pass through a dataset during training of a machine learning model. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Fetch Request: A method in JavaScript used to make HTTP requests to a server, such as retrieving data from an API.

Fine-Tuning: The process of training a pre-trained AI model on a specific dataset to tailor it to a particular task or domain.

Frequency Penalty: An OpenAI setting that reduces the likelihood of the model repeating the same words or phrases.

Few-Shot Approach: A prompt engineering technique that uses one or more examples in the prompt to guide the AI in generating the desired output.

Hallucination: When an AI model generates an incorrect or nonsensical output that may sound plausible.

JSON (JavaScript Object Notation): A lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate.

JSON-L (JSON Lines): A format where each line is a valid JSON object, often used for storing datasets for machine learning.

Model: An algorithm that has been trained on data to perform a specific task, such as text generation.

Netlify: A web development platform that provides serverless hosting, continuous deployment, and other features.

OpenAI: An artificial intelligence research and deployment company, responsible for creating many large language models, including GPT-4.

Presence Penalty: An OpenAI setting that encourages the model to talk about new topics by reducing the chance of repeating similar subject matter.

Prompt: An input provided to an AI model to generate a response, often in text form.

Serverless Function: A function that executes in a cloud environment, allowing developers to run backend code without managing servers.

Stop Sequence: A sequence of characters in an AI prompt that signals to the model to stop generating text.

Temperature: An OpenAI setting that controls the randomness and creativity of the model’s output.

Token: A small chunk of text used by OpenAI, generally about 75% of a word, for processing and generating text.

Zero-Shot Approach: A prompt engineering technique that uses a simple instruction without any examples.

AI-Powered Web Development Projects

Okay, here is a detailed briefing document summarizing the main themes, ideas, and facts from the provided text.

Briefing Document: AI-Powered Web Development Projects

Overview:

This document summarizes a series of web development projects focused on integrating AI, specifically OpenAI’s models, into different applications. The projects progress from a movie pitch generator to a sophisticated chatbot with persistent storage and a fine-tuned customer service model. The primary focus is on practical application and prompt engineering, with a strong emphasis on understanding how different parameters influence AI responses.

Main Themes & Concepts:
- Leveraging OpenAI API: The core theme is using the OpenAI API to generate text and images for various purposes, including creative writing, question-answering, and image creation.
- Prompt Engineering: The course emphasizes crafting effective prompts to guide AI models towards desired outputs, experimenting with wording, and understanding the impact of examples on the quality and format of responses. Key techniques include:
- Zero-Shot Prompts: Simple instructions without examples.
- Few-Shot Prompts: Providing examples within the prompt to guide the model.
- Using separators: Triple hash marks to separate different parts of a prompt (instructions, examples, input)
- AI Models: The course explores several OpenAI models, highlighting their strengths:
- GPT-3.5 models (text-davinci-003): Good for long text generation and following instructions.
- GPT-4: The latest model, used for advanced chatbots and better contextual understanding.
- Codex models: Designed for generating computer code.
- Tokens and Max Tokens: Tokens are fundamental units of text processed by OpenAI, and max_tokens property controls the length of the generated text. “Roughly speaking, a token is about 75% of a word. So 100 tokens is about 75 words.”
- Temperature: Controls the randomness and creativity of the AI’s output; lower values are for more predictable, factual responses, higher values for more creative and varied outputs. “What temperature does is it controls how often the model outputs a less likely token… giving us some control over whether our completions are safe and predictable on the one hand or more creative and varied on the other hand.”
- Fine-Tuning: Training a model with a custom dataset to achieve specific and focused responses. This section demonstrates using a customer service dataset.
- Chatbot Specifics:
- Conversation Context: Maintaining a conversation history to provide context for subsequent questions.
- Avoiding Repetition: Using frequency_penalty and presence_penalty settings to control how much the chatbot repeats or stays on topic.
- presence_penalty is used to “increase the model’s likelihood of talking about new topics” while frequency_penalty is used to reduce the likelihood of the model “repeating the exact same phrases.”
- API Key Security: Implementing strategies for securely using API keys in front-end projects, such as storing them as environment variables and utilizing Netlify serverless functions to mask API keys during deployment.
- Database Persistence: Utilizing Google Firebase to store chatbot conversation data, allowing users to resume conversations after refreshing or reloading the page.
- Error Handling and User Experience: The projects include loading states, and messages to improve user experience, as well as debugging and error tracking through the console.
Project Highlights and Key Ideas:
- Movie Pitch Generator:Takes a one-sentence movie idea and expands it into a full outline, including title, synopsis, and potential cast.
- Demonstrates basic API interactions with OpenAI.
- Explores techniques to make the responses more detailed and relevant to user input.
- “Know It All” Chatbot:Utilizes the GPT-4 model for natural language conversation.
- Implements conversation persistence using Google Firebase.
- Emphasizes the need for chatbots to maintain context.
- Uses frequency_penalty and presence_penalty to control the chatbot’s output.
- Focuses on having a configurable personality using a system instruction.
- Fine-Tuned Chatbot:Uploads custom data (customer service interactions) to fine-tune a model for specific answers.
- Demonstrates the importance of data formatting, including the use of separators, spacing and stop sequences to format the prompts and completions correctly.
- Explores the concept of epochs, which determine how many times the model iterates through the training data. The text highlights the use of 16 epochs.
- Highlights the use of the OpenAI CLI to prepare the data and run the fine-tuning process in the terminal.
- Secure API Calls:Demonstrates masking the API keys by creating an endpoint via Netlify Functions and calling this endpoint via a fetch request instead of directly calling the OpenAI API from the front end.
- Explores the error that is triggered by a cross-origin request, showcasing that the Netlify serverless function endpoint is secured.
Key Quotes:
- “Studying is more fun and more productive when it’s done together. So, why not interact with fellow students on the Discord community, encourage each other and help each other along.” (Emphasizes collaborative learning).
- “What used to be science fiction is now science fact.” (Highlights the advanced nature of AI)
- “You only get back as much as you put in, so it’s giving us this very boring, generic reply.” (Highlights the importance of effective prompts)
- “An AI model is an algorithm that uses training data to recognize patterns and make predictions or decisions.” (Defines the nature of an AI model)
- “Roughly speaking, a token is about 75% of a word. So 100 tokens is about 75 words.” (Defines tokens)
- “What temperature does is it controls how often the model outputs a less likely token… giving us some control over whether our completions are safe and predictable on the one hand or more creative and varied on the other hand.” (Defines the function of the temperature property)
- “The AI makes up a linguistically plausible answer when it doesn’t know the right answer. And we’ll talk more about hallucinations later in this course.” (Introduces the idea of hallucination in AI)
- presence_penalty is used to “increase the model’s likelihood of talking about new topics” while frequency_penalty is used to reduce the likelihood of the model “repeating the exact same phrases.” (Defines presence and frequency penalties)
- “Each completion should end with a stop sequence to inform the model when the completion ends.” (Highlights the importance of the stop sequence).
- “when you’re working with APIs with secret keys… this solves the really big problem that we have when we’re using APIs with secret keys in front-end projects.” (Highlights the importance of keeping API keys secure).
Next Steps & Future Applications:
- The course encourages building upon these projects, experimenting with different prompts, models, and settings.
- Specific recommendations include:
- Creating more detailed character sketches with image generation.
- Tailoring apps to specific genres.
- Building more robust error handling.
- Fine-tuning models with much larger datasets for production use.
- Building apps with a very specific use case in mind.
- Adding error handling.
Conclusion:

These projects offer a comprehensive introduction to using AI for web development. By emphasizing hands-on experience with prompt engineering, API interactions, and model fine-tuning, this series lays a solid foundation for further exploration and innovation in AI-driven applications. The course also highlights the importance of security, persistence, and creating a good user experience.

Building AI Web Applications with OpenAI

Frequently Asked Questions: AI Development and OpenAI
- What is the main focus of the projects being developed in this course?
- The course focuses on building AI-powered web applications using OpenAI’s large language models (LLMs). These projects include a movie pitch app that generates movie outlines from a single sentence idea, an “Ask Me Anything” chatbot named Know It All, and a customer service chatbot fine-tuned with specific data. These projects emphasize creative use of language models, user interaction, and data persistence. The course also addresses real-world scenarios, like hiding API keys and deploying projects.
- What are the prerequisites for this course?
- The primary prerequisite is a reasonable knowledge of vanilla JavaScript. A basic understanding of fetch requests is also beneficial, but the course will review and explain these concepts step-by-step. The focus will be on the AI aspects of the projects, rather than complicated JavaScript programming.
- How does the movie pitch app work, and what technologies are used?
- The movie pitch app takes a one-sentence movie idea as input and leverages OpenAI’s models to generate a full movie outline, including a title, artwork, a list of stars, and a synopsis. It uses the OpenAI API, and concepts like crafting prompts, tokens, and model training through examples are all covered in the course to build this application. It also demonstrates how to handle asynchronous requests and updates to the user interface using JavaScript.
- What are the different types of AI models mentioned in the course, and which are used?
- The course discusses different types of OpenAI models including:
- GPT-3, GPT-3.5, and GPT-4 models: These are designed for understanding and generating natural language, as well as computer languages. GPT-4 is the latest model and is used for the Know It All chatbot, while text DaVinci 003 (a GPT-3.5 model) is used for other projects.
- Codex models: These models are specifically designed to generate computer code. The course uses the text-davinci-003 model initially, and later upgrades to GPT-4. They emphasize that GPT-3.5 Turbo model can also be used as a substitute for GPT-4.
- What is a token in the context of OpenAI, and how does max_tokens affect a completion?
- In OpenAI, text is broken down into chunks called tokens, with one token being roughly 75% of a word. The max_tokens property controls the maximum length of the text generated by the AI model. It is particularly important to set this value to have control of how much the AI completes, and failure to set this property can cut off responses or cause inconsistent behaviors. The default limit is 16 tokens with the older text-davinci-003 model, and the course recommends setting a higher number.
- What is the few-shot approach to prompt engineering, and why is it useful?
- The few-shot approach involves providing one or more examples of the desired output directly within the prompt to guide the AI model’s generation. By including examples, you can significantly improve the relevance, format, and quality of the AI’s responses. This is compared to the zero-shot approach, where only instructions are given, which often leads to poor quality output for complex requests. The examples are often separated with triple hashtags or triple inverted commas.
- How is data persistence achieved in the Know It All chatbot, and how can the chat be reset?
- The Know It All chatbot uses Google Firebase to store the conversation history, allowing users to continue their chat even after refreshing or reloading the browser. A reset button is implemented, which clears the database and restarts the conversation from the beginning. The course reviews methods for importing the Firebase dependencies, establishing references to the Firebase database, and writing and deleting data to persist and reset chat sessions.
- What is fine-tuning, and what steps are involved in creating a fine-tuned model?
- Fine-tuning involves training a pre-existing large language model with a specific dataset, to get more targeted responses. The course uses a CSV formatted dataset that contains prompt-completion pairs to fine tune a customer service bot. The steps involved in fine-tuning a model include setting up a command-line interface (CLI) with Python, preparing the data using OpenAI’s data preparation tool (which will convert it into JSONL format), and using the CLI to upload and train the model on the prepared data. Also, the course addresses the concept of epochs and using the CLI to increase the epochs when creating a fine-tuned model, as well as setting the presence and frequency penalty to reduce repetition in output. Finally, the course addresses hiding the API key in the deployed project using Netlify environment variables and using serverless functions for making calls to the API to hide these keys.
Movie Pitch App: OpenAI API Integration

The Movie Pitch app is designed to generate creative movie ideas using the OpenAI API. Here’s a breakdown of its key features and development process:
- Core Functionality: The app takes a one-sentence movie idea from the user and, using the power of OpenAI, generates a full movie outline, including:
- A title
- A synopsis
- Artwork for the cover
- A list of stars
- Technology Used: The app utilizes the OpenAI API and various models including the text DaVinci 003. It also incorporates HTML, CSS, and JavaScript.
- Development Process:Initial Setup: The app starts with a basic HTML structure, including a text area for user input and designated areas for displaying the AI-generated content.
- API Integration: The app uses fetch requests to communicate with the OpenAI API, sending prompts and receiving responses.
- Prompt Engineering: The course emphasizes the importance of crafting effective prompts to guide the AI’s responses. This involves:
- Understanding how to use tokens
- Tweaking prompts to get desired results
- Using examples to train the model
- Using a zero-shot approach, where a simple instruction is given
- Moving to a few-shot approach by adding one or more examples to the prompt
- Using separators to distinguish instructions and examples
- Using techniques to control the length of the output such as specifying the number of words or using max tokens
- Personalized Responses: The app is designed to provide personalized responses based on the user’s input.
- Text Extraction: The app extracts the names of actors from the generated synopsis.
- Image Generation: The app also utilizes the OpenAI API to generate images based on the movie concept. This involves converting the synopsis and title into a suitable image prompt.
- Key Concepts:AI Models: The course introduces different OpenAI models, including GPT-3, GPT-3.5, and GPT-4, as well as Codex models. It explains that these models are algorithms that use training data to recognize patterns and make decisions or predictions.
- Temperature: The course also covers the concept of temperature, a property used to control the creativity and predictability of AI completions.
- Tokens: The course explains how the OpenAI API uses tokens and how they affect the length and cost of API requests.
- Deployment Considerations:The course discusses the importance of securing API keys when deploying front-end projects. It uses Netlify to safely store the API key on a server.
- Potential Improvements:The course suggests that the code could be refactored to improve reusability, and to focus more on AI and less on Javascript.
- The course also suggests exploring the idea of having the AI generate a script for the movie
- The course also suggests tailoring the app to a specific genre
- Warnings:
- The course emphasizes that while developing locally the API key is visible on the front end and anyone could steal the API key.
- The course suggests not sharing the project with the API key or publishing it to GitHub without ignoring the API key because that will compromise the API key.
In summary, the Movie Pitch app is an interactive project that demonstrates how to use the OpenAI API to generate creative movie concepts. It introduces core concepts in AI and prompt engineering and highlights best practices in building and deploying AI-powered applications.

OpenAI API Guide

The OpenAI API is a central component in building AI-powered applications, as demonstrated in the Movie Pitch app. Here’s a breakdown of key aspects of the OpenAI API as discussed in the sources:
- API Key: To use the OpenAI API, you need an API key, which can be obtained by signing up on the OpenAI website. The API key needs to be kept secret, and the sources caution against sharing it or publishing it without taking precautions to protect it.
- Endpoints: The OpenAI API has different endpoints for different tasks.
- Completions Endpoint: This endpoint is used to generate text based on a prompt. It is central to the API. The API takes a prompt and sends back a “completion” that fulfills the request.
- Chat Completions Endpoint: This endpoint is designed for chatbot applications and is used with models like GPT-4 and GPT 3.5 Turbo.
- Create Image Endpoint: This endpoint is used to generate images based on text prompts.
- Models:
- OpenAI has various models geared toward different tasks.
- GPT Models: GPT-3, GPT-3.5, and GPT-4 are used for understanding and generating natural language and can also generate computer languages. GPT-4 is the newest and most advanced model.
- Codex Models: These models are specifically designed to generate computer code.
- The models vary in terms of complexity, speed, cost, and the length of the output they provide.
- The sources suggest starting with the best model available and then downgrading to save on time and cost where possible.
- Fine-tuned models can be created using a custom dataset.
- Prompts:
- A prompt is a request for the OpenAI API. Prompts can be simple or complex.
- Prompt engineering is a key skill when working with the OpenAI API. It involves crafting effective prompts to guide the AI’s responses.
- The sources describe three approaches to prompt design:
- Zero-shot approach: This involves giving a simple instruction or asking a question.
- Few-shot approach: This involves adding one or more examples to the prompt to help the AI understand what is required.
- Using separators like triple hashes (###) or triple inverted commas to separate instructions and examples within a prompt.
- Good prompt design is key to controlling the length of the output and ensuring the text from OpenAI is of the desired length.
- Tokens:OpenAI breaks down chunks of text into tokens for processing.
- A token is roughly 75% of a word.
- The number of tokens used impacts the cost and processing time of API requests.
- The max tokens property can be used to limit the length of the completion. If not set, the model defaults to a low number, which may cause the text to be cut short.
- Temperature:The temperature setting controls how often the model outputs a less likely token.
- It can be used to control how creative and varied a completion is.
- Usage and Cost:
- OpenAI provides some free credit when you sign up, but after that, it uses a pay-as-you-go model.
- The cost of using the API depends on the model, the number of tokens, and the number of images generated.
- Authentication: The API requires authentication via the API key in the header of the request.
- Security: The API key should be kept secret. It is important not to expose it on the front end when deploying applications. The sources suggest using a serverless function to hide the API key from the front end code.
In summary, the OpenAI API is a versatile tool for building a wide range of AI-powered applications. It offers different models, endpoints, and configuration options to perform tasks like text generation, image creation, and creating chatbots. Understanding how to use tokens, craft effective prompts, and secure API keys are crucial for working with the OpenAI API.

Building Chatbots with the OpenAI API

Creating a chatbot using the OpenAI API involves several key steps, from setting up the API to fine-tuning the model. Here’s a breakdown of the process, based on the sources:
- API Setup: The process begins with setting up the OpenAI API, which involves obtaining an API key and understanding the different endpoints.
- For chatbots, the Chat Completions endpoint is used. This endpoint is designed to handle conversational exchanges.
- The API key should be kept secure and not exposed on the front end.
- Model Selection: The choice of model is crucial for a chatbot’s performance.
- GPT-4 is the most advanced model at the time of recording and is well-suited for chatbot applications.
- GPT-3.5 Turbo is also a very capable model that can be used as an alternative when access to GPT-4 is limited.
- The models vary in terms of their ability to generate human-like text, their cost, and their speed.
- Conversation Handling:
- Chatbots require a memory of past interactions to maintain context and provide coherent responses.
- Unlike the text DaVinci 003 model, the models used with the Chat Completions endpoint do not have a memory of past completions.
- To maintain context, the entire conversation history must be sent with each API request.
- The conversation is stored in an array of objects, where each object represents a message in the conversation.
- The first object in the array is an instruction that tells the chatbot how to behave. This object has a role key with a value of system and a content key with a string containing the instruction.
- Subsequent objects store the user’s input and the API’s responses. These objects have a role key with either a value of user or assistant and a content key with a string containing the message.
- API Requests:
- API requests are sent to the Chat Completions endpoint with the createChatCompletion method, along with a messages property holding the conversation array.
- The API response is then added to the conversation array to maintain context for the next request.
- The API request also needs to specify a model property.
- Chatbot Personality:
- A chatbot’s personality can be customized through the instruction object at the beginning of the conversation array.
- This object can be used to tell the chatbot to be sarcastic, funny, practical or any other personality.
- It can also be used to control the length of the responses or simplify the language.
- Response Handling:
- The chatbot’s response from the API needs to be rendered to the DOM and added to the conversation array.
- The response from the API will include the role and the content.
- Presence and Frequency Penalties:
- Presence penalty can be used to control how likely a chatbot is to talk about new topics.
- Frequency penalty can be used to control how repetitive the chatbot is in its choice of words and phrases.
- The sources suggest not going over one and not going under zero for either setting.
- Data Persistence:To make the conversation persistent, a database can be used to store the conversation array.
- The sources use Google Firebase for this purpose.
- The conversation is stored in the database and is loaded into the app when the page loads.
- The user can reset the conversation using a button that removes the data from the database and clears the display.
- Fine-TuningChatbots can be fine-tuned with a custom dataset to answer specific questions about a company.
- A fine-tuned model is trained on a dataset that is prepared in JSONL format.
- The data set includes prompts and completions and is prepared using the OpenAI CLI tool.
- When using a fine-tuned model, the Completions endpoint and createCompletion method is used. The API request should also have a prompt property rather than the messages property used by models such as GPT-4.
- When working with a fine-tuned model it is important to use a stop sequence and to end the prompt with a separator. The sources used a space and an arrow (->) as a separator and a new line character (\n) as a stop sequence.
- The temperature setting can be used to control how creative and varied the completions are. If factual answers are desired it should be set to 0.
In summary, creating a chatbot involves using the OpenAI API, selecting the appropriate model, managing conversation context, and handling responses. Additional steps such as fine-tuning and data persistence can be added to enhance the bot’s capabilities.

Fine-Tuning AI Models

Fine-tuning AI models is a way to customize them for specific tasks and datasets, as discussed in the sources. Here’s a breakdown of key concepts related to fine-tuning:
- Purpose of Fine-tuning:
- General-purpose AI models, like those trained by OpenAI, are trained on publicly available data. While this works well for general tasks such as Q&A or translation, it isn’t ideal for tasks that require specific information.
- Fine-tuning is used to address the limitations of general models by providing them with a custom dataset. This allows them to answer questions specific to a company or domain.
- Fine-tuning enables models to provide accurate responses and avoid generating incorrect answers, also called “hallucinations”.
- Data Preparation:
- High-quality, vetted data is essential for effective fine-tuning. The data should be relevant to the specific task for which the model is being fine-tuned.
- The sources recommend at least a few hundred examples, and possibly thousands, for optimal results.
- Data is formatted as pairs of prompts and completions.
- The data should be formatted as JSON-L, where each line is a valid JSON object.
- OpenAI’s data preparation tool can be used to convert data from CSV to JSON-L format.
- The tool adds a separator to the end of each prompt, a whitespace to the beginning of each completion, and a stop sequence to the end of each completion.
- Fine-tuning Process:
- The fine-tuning process is initiated using the OpenAI command-line interface (CLI) tool.
- The CLI tool takes the training data file and a base model as inputs.
- The base model is the starting point, and the model is customized using the training data.
- The sources used the DaVinci model as a base model for fine-tuning.
- The fine-tuning process takes time, ranging from minutes to hours.
- The CLI tool uses a command like openai fine_tunes.create -t <TRAINING_FILE> -m <BASE_MODEL>.
- Epochs:
- Epochs refers to the number of times the model cycles through the training data.
- The default number of epochs is four, which might be sufficient for larger datasets but not for smaller ones.
- The number of epochs can be specified in the fine-tuning command using the flag –n_epochs <NUMBER_OF_EPOCHS>. For smaller datasets, the sources recommend using 16 epochs for improved results.
- Using a Fine-Tuned Model:
- After fine-tuning, a unique model ID is provided.
- The fine-tuned model can then be used in an application. The sources show how a chatbot was customized by using a fine-tuned model.
- Fine-tuned models use the Completions endpoint and the createCompletion method.
- The API request should have a prompt property rather than a messages property.
- It is also important to use a stop sequence to prevent the bot from continuing the conversation on its own. The sources used a new line character (\n) as a stop sequence and a space and an arrow (->) as a separator.
- Benefits of Fine-Tuning:
- Fine-tuning allows the model to provide accurate and specific responses tailored to the training dataset.
- It can improve a model’s ability to understand context and nuance.
- Fine-tuning is useful when it is important for an AI model to be able to say “I don’t know” rather than make up an answer.
- Fine-tuning can enable the model to avoid generating incorrect answers or “hallucinations”.
In summary, fine-tuning involves preparing a custom dataset, training a model on this data, and using the new model in an application. Fine-tuning enables the AI model to give more specific and accurate responses than it could have given without fine-tuning.

Securing OpenAI API Keys

API key security is a crucial aspect of working with services like OpenAI, as highlighted in the sources. Here’s a breakdown of the key points related to API key security:
- Risk of Exposure: API keys should be kept secret because they provide access to the associated service. If an API key is exposed, unauthorized individuals could potentially use the service, leading to unexpected charges or other misuse.
- API keys can be exposed if they are included directly in front-end code.
- When developing locally, the API key may be visible in the code, but this is acceptable for local development.
- Sharing a project with an API key or publishing to GitHub without hiding the API key will compromise the API key.
- Hiding API Keys: To prevent API key exposure, it’s important to keep them out of the client-side code. The sources recommend the following strategies for hiding API keys:
- Server-Side Storage: API keys should be stored on a server, rather than on the front end. This ensures that they are not visible to users.
- Environment Variables: API keys can be stored in environment variables on a server. This prevents them from being directly included in the code.
- When using Netlify, environment variables can be set in the site settings.
- Serverless Functions: Serverless functions can be used as an intermediary between the front end and the API. The serverless function can have access to the API key, while the front end does not.
- The serverless function makes the API call and returns the data to the front end, without exposing the API key.
- Best Practices:
- API keys should be treated like passwords and kept confidential.
- It is important to avoid sharing API keys or publishing them to public repositories.
- When working with API keys, it’s important to be mindful of what you’re doing and to ensure that the keys are not being shared inadvertently.
- API keys should only be stored in secure locations.
- When using an API key on a front-end project, it’s vital to take steps to hide it before sharing the project.
- Consequences of Exposure:
- If an API key is exposed, unauthorized users could potentially use it, which could result in unexpected charges.
- Compromised API keys can be used for malicious purposes.
- If an API key is lost, it is best to delete it and create a new one.
- Netlify Specific Security:
- When using Netlify, a serverless function will only accept requests from its own domain, so other domains cannot make fetch requests to that serverless function.
In summary, API key security is paramount when working with APIs. Storing API keys on a server, using environment variables, and utilizing serverless functions are effective strategies for hiding API keys and preventing unauthorized access.

Build AI Apps with ChatGPT, DALL-E, and GPT-4 – Full Course for Beginners

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 15, 2025
Bridging the Sim-to-Real Gap in AI Robotics
This podcast features an interview with Peggy Wong, a robotics engineer and CTO of a Y Combinator-funded startup. Wong discusses the convergence of AI and robotics, highlighting advancements in affordable hardware and AI models that enable more generalized robotic tasks. She shares her experiences at Stanford and internships at companies like Lyft and Oculus, emphasizing the increasing capabilities of AI agents in various fields. The conversation also explores the potential of AI-powered characters in video games, creating more immersive and personalized gaming experiences, and the future of self-driving cars and VR technology. Finally, Wong offers advice for aspiring robotics engineers and suggests valuable resources for staying updated in the field.

Robotics and AI: A Deep Dive with Peggy Wong

Quiz

Instructions: Answer the following questions in 2-3 sentences each, based on the source material.
1. According to Peggy Wong, what is narrowing the gap between simulations and reality in the context of robotics and AI?
2. What does Peggy identify as a significant benefit of being able to run AI models locally, rather than relying on cloud providers?
3. Why does Peggy think younger generations are becoming “AI natives,” and how will this impact future AI adoption?
4. What are two significant factors that make personal robots more accessible and affordable to consumers, according to Peggy?
5. How much time does Peggy estimate that a generalized household robot could save a person per week, and what specific tasks does she mention as examples?
6. Why does Peggy think humanoid form factor is important for household robots?
7. What are two major breakthroughs that have accelerated advancements in robotics?
8. How did Peggy’s background in Milwaukee influence her interest in robotics and coding?
9. What did Peggy learn from her internships at Lyft Level 5 and Oculus?
10. What is Ego’s vision for infinite games, and how does AI play a crucial role?
Quiz Answer Key
1. Peggy Wong states that the gap between simulations and reality is closing due to advancements in graphics, fidelity, and the ability to control robots in 3D simulations. When robots can be trained in simulations that closely mimic reality, they can then generalize that training to real-world tasks more effectively.
2. Running AI models locally, using personal hardware like Nvidia GPUs, can significantly reduce costs by eliminating per-token charges associated with cloud providers. This enables more accessible AI applications for the average person, who can pay a one-time fee to run AI models on their computers.
3. Younger generations are growing up using AI tools like ChatGPT for schoolwork, making them “AI natives” who are more comfortable with and open to AI adoption. This early exposure means they’ll likely continue to use AI in college and the workforce, leading to a more widespread acceptance of AI in the future.
4. The increasing affordability of robots is due to decreased hardware costs from advances in GPU technology and more cost-effective manufacturing. This enables personal robots with local AI processing capabilities to become more affordable for consumers.
5. Peggy believes a household robot could save anywhere from 2 to 10 hours per week, particularly by handling tasks like laundry, washing dishes, and cooking. She expresses how the robot would eliminate the time spent physically attending to these tasks.
6. Peggy thinks that the humanoid form factor is crucial because humans can do a wide variety of tasks in diverse scenarios due to their design. She suggests humanoid robots are therefore more adaptable and general-purpose for performing many common household chores and tasks.
7. Peggy identifies software/AI advancements as a major breakthrough, specifically generalized AI models that can do a variety of tasks. Another key factor she identifies is decreased hardware costs, which increases access and innovation in the space.
8. Growing up in Milwaukee, a manufacturing town, surrounded Peggy with stories of robotics in surgery, healthcare, and car manufacturing. This exposure, coupled with her school’s robotics program, sparked her interest in robotics and coding.
9. At Lyft Level 5, Peggy worked on behavior planning for self-driving cars, developing algorithms that controlled how cars move in various traffic scenarios. At Oculus, she worked on depth-sensing algorithms to create ground-truth depth maps that enhance the user experience.
10. Ego envisions infinite games as immersive, personalized virtual worlds that offer endless gameplay. AI is essential to this vision, enabling the generation of unique content, behaviors, scenarios, and human-like agents or NPCs that make those worlds dynamic and engaging based on player preferences.
Essay Questions

Instructions: Respond to the following questions with a well-developed essay that addresses the topic and uses information from the source material.
1. Discuss the potential impact of AI-powered humanoid robots on daily life, using examples from Peggy’s comments on household chores, time-saving, and cost reduction. To what degree is this vision aligned with, or divergent from, the vision of robotics in popular media?
2. Analyze Peggy Wong’s career trajectory and educational background. Discuss how her experiences, including both her early childhood, academic work, and internship experiences, shaped her goals for her startup, Ego.
3. Critically evaluate the role of AI and machine learning in the advancement of robotics and VR/AR technology. Refer to specific examples from Peggy’s discussion of self-driving cars, depth sensing, and AI agents in video games.
4. Considering Peggy’s perspective on accessibility of AI and robotics, discuss how affordability of hardware and software affect innovation in the field. How could increased accessibility broaden participation and advance the state of technology?
5. Examine Peggy Wong’s remarks on the convergence of the physical and digital worlds in her work and what are the ethical implications. Focus on the rise of human-like AI agents in video games and the potential of simulations that blur the line between reality and virtual experiences.
Glossary of Key Terms
- AI (Artificial Intelligence): The development of computer systems that can perform tasks typically requiring human intelligence, such as learning, problem-solving, and decision-making.
- AR (Augmented Reality): A technology that overlays digital content onto the real world, often through devices like smartphones or specialized headsets.
- VR (Virtual Reality): A technology that creates immersive, simulated experiences, often through headsets, which can transport users into entirely digital environments.
- Humanoid Robot: A robot with a body shape resembling a human, designed to perform tasks similar to those of a human.
- GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device.
- LLM (Large Language Model): A type of artificial intelligence model that has been trained on a large amount of text data to understand, generate, and interact with human-like language.
- NPC (Non-Player Character): A character in a game or virtual environment that is controlled by the computer, not by a human player.
- Y Combinator: A startup accelerator that provides seed funding, mentorship, and networking opportunities to early-stage technology companies.
- AGI (Artificial General Intelligence): A hypothetical type of AI that can perform any intellectual task that a human being can.
- Sim-to-Real Gap: The discrepancy between simulations and real-world situations, often in the context of robotics and AI, where models trained in simulations may not perform as well in real life due to unexpected variables.
- Embodied AI Agent: An AI agent that has a physical body or virtual representation and can interact with the world in a physical or virtual space.
- Infinite Game: A game that, through dynamic world and character generation, offers endless gameplay without a defined end, adapting to the player’s actions and interests.
- Digital Native: A person who has grown up with digital technology and is comfortable using it.
- AI Native: A person who has grown up with AI technology and is comfortable using and interacting with it.
- Indie Game Development: The creation of video games by small teams, individuals, or companies that operate independently of large publishers.
- Procedural Generation: The automated creation of content, such as environments or characters, using algorithms rather than manual design.
- Uncanny Valley: A sense of unease or discomfort experienced when a human-like figure (a robot, animation, or AI character) looks or behaves almost, but not exactly, like a real human.
- Sensor Fusion: Combining data from multiple sensors (like cameras, lidar, radar, or sonar) to provide a more accurate and robust understanding of an environment.
- Behavior Planning: The process of determining a sequence of actions for a robot or AI agent to achieve a goal, such as navigating a complex environment.
AI, Robotics, and the Future: A Conversation with Peggy Wong

Okay, here is a detailed briefing document summarizing the key themes and ideas from the provided podcast transcript featuring Peggy Wong, CTO and Robotics Engineer:

Briefing Document: Free Code Camp Podcast Interview with Peggy Wong

Introduction:

This document summarizes the key themes and ideas discussed in the Free Code Camp podcast interview with Peggy Wong, CTO and robotics engineer. The interview covers her personal journey into coding and robotics, her experiences at Stanford, her work in self-driving cars and augmented reality, and her current focus on AI agents in gaming. The discussion also delves into the potential future of AI, robotics, and their impact on daily life.

Key Themes and Ideas:
1. The Closing Gap Between Simulation and Reality:
- Peggy emphasizes that advancements in graphics and AI are rapidly closing the gap between simulations and real-world applications.
- Quote: “I think the gap between simulations and reality is getting closer and closer… like Graphics level like Fidelity like all of that um I actually think that the the Sim toore Gap is is closing…”
- This is especially relevant for robotics. Training agents in robust simulations can lead to real-world robots with generalized task abilities within a few years.
- Quote: “if you are able to like build and rig up um basically all the controls that a robot is in like a 3D video game or 3D simulation… you can actually like that Gap this simulation to reality Gap that Sim the real Gap is actually like pretty close and you should be able to generalize that to the robot in like you know a couple years.”
1. AI as a Key Enabler for Robotics:
- Peggy believes that the biggest advancements in robotics are coming from advancements in AI.
- She highlights the shift from specific AI models (e.g., object recognition) to general models (LLMs) as a parallel to the shift from specialized robots to generalized humanoid robots.
- Quote: “the advancements in AI is like actually like one of the biggest unlocks for robotics…”
1. The Impact of Lower Hardware Costs:
- The decreasing cost of GPUs (e.g., Nvidia digits) and robotic hardware is making AI and robotics more accessible for average consumers.
- Quote: ” the cost of Hardware has like decreased… Jensen hang is is at the Forefront of this with Nvidia… he like is making these gpus like better are faster cheaper and that is allowing a lot of like new ability to train these like large AI models…”
- This allows for the possibility of personal robots for household tasks, and local AI processing.
1. The Potential of Generalized Humanoid Robots:
- Peggy envisions a future where a single generalized robot can perform a variety of household tasks like laundry, cooking, and cleaning.
- She notes the advantage of a humanoid form factor as our living spaces are designed for humans, and these robots could potentially replace multiple specialized appliances.
- Quote: “if you have a humanoid robot they’re able to kind of uh almost combine combine them and like be able to do a little bit of everything right.”
1. AI Native Generation:
- Younger generations are adopting AI tools (like ChatGPT) early in their education, making them “AI natives” who will drive further adoption and innovation in the future.
- Quote: “one of the neat things about AI adoption is that the people who like start using it and are I guess like instead of like digital natives they’re now like AI natives they’re all like younger kids they’re all students they all like use like AI to help them finish their homework assignments…”
1. Personalized Gaming with AI Agents:
- Peggy’s startup, Ego, is focused on creating “infinite games” where AI agents generate worlds and scenarios based on players’ preferences, potentially offering highly tailored experiences.
- These human-like AI agents can build relationships with players, act as companions, coaches, or even adversaries, blurring the lines between human and AI interactions in games.
- Quote: “We’re building s like you know infinite games s are online where people like you can like essentially play like any game that you’re interested in and because the world and the agent will just like generate that for you while you’re playing it…”
1. The Importance of Personal Motivation and Story:
- Peggy emphasizes the importance of personal motivation and passion in addition to academic achievements when applying to selective institutions.
- Quote: ” I talked a lot about my passion for for robotics… it’s more about kind of like that story you tell and like what motivates you in addition to like all those like high test scores that are almost like a a baseline necessity.”
- She also suggests that parents should nurture their children’s interests.
1. The State of Self-Driving Cars:
- Peggy believes that the technology for autonomous driving is already good and that the main challenges lie in engineering, productionization and public acceptance.
- She estimates that fully autonomous driving could be a reality in 3-5 years.
- She believes that Large Language Models will help address edge cases.
1. The Future of VR and Immersive Experiences:
- Peggy speculates that the future of VR hinges on developing ultra high resolution and fast refresh rate to create truly immersive environments.
- She suggests that some of these technologies may already exist but not in a form that is affordable.
1. Ethical Considerations:
- Peggy acknowledged the need for transparency when AI agents are used to simulate humans. There should be some form of disclosure to ensure consent.
- She suggests that the human actors behind AI and their political motivations may be as important of a concern as the technology itself.
Quotes highlighting Peggy’s personal journey:
- “I’ve been working in robotics since high school and then I’ve been working on AI since you know um freshman year of college and so this is like really my my life’s passion…”
- “I was born in China… I moved out here uh to Milwaukee Wisconsin when I was about 2 years old…”
- “I learned how to code um basically on my high school robotics team and… I came across your free free coding Camp…”
Conclusion:

Peggy Wong’s interview highlights the transformative potential of AI and robotics, driven by both technological advancements and decreasing costs. Her work at Ego demonstrates a vision of a personalized and immersive future enabled by these tools. Her story underscores the value of passion, hard work, and a willingness to explore emerging technologies. The interview also touches upon the ethical considerations that will need to be carefully considered as these technologies become increasingly prevalent.

The Future of AI and Robotics

What is driving the current advancements in robotics and AI?

The most significant advancements in robotics are being driven by progress in AI, particularly the development of generalized AI models that can perform a variety of tasks, rather than just one specific one. This parallels the shift in AI from specific, task-oriented machine learning models to large language models. Additionally, the decrease in the cost of hardware, especially GPUs and manufacturing robotics, is making AI and robotics more accessible.

How close are we to having generalized robots for everyday tasks?

The gap between simulation and reality is closing rapidly. It’s becoming increasingly feasible to train AI agents within 3D simulations (like video games) and then transfer that knowledge to physical robots in a short period of time. This combined with more affordable hardware, could lead to generalized robots that can perform a variety of household tasks within the next few years, perhaps even within this decade.

How has AI been utilized in recent technologies?

AI is being applied in many ways, most prominently in large language models like ChatGPT, which are being used by students for tasks like homework assignments. AI is also being integrated into robots, which are becoming more affordable and capable. Moreover, AI is used for depth sensing in augmented reality/virtual reality headsets and for avatar face tracking for AR.

What are the benefits of having personal robots in the home?

Personal robots could potentially save anywhere from 2 to 10 hours a week by doing chores like laundry, dishes, and cooking. They could also make healthier meals possible, and they could replace multiple appliances, saving space and maintenance costs. The ultimate goal is to create a robot that is a generalist and can do pretty much any task, versus very specific specialized machines.

Why is the humanoid form factor potentially important for these robots?

A humanoid robot is important because our homes and environments are designed with the human form factor in mind. It is also important because humans themselves are generalists capable of doing many tasks, which would require a generalist robot. This would help create a single robot that could replace multiple appliances and specialized machines. This idea is similar to the iPhone, which combined many single-purpose devices into one versatile tool.

How are games and AI becoming intertwined?

Games are becoming more realistic, often serving as effective simulation environments to train AI agents, especially for robotics and self-driving cars. AI is also being used in games to create human-like agents that interact realistically with players, potentially leading to “infinite games” that are procedurally generated by AI.

What is the vision of ego, the company that you founded?

Ego aims to build humanlike AI agents in games, ultimately creating “infinite games” where the game world and characters are generated on the fly based on the player’s preferences and interests. These agents would behave as humans would and can build relationships with the human players. This could eventually lead to personalized virtual worlds for each user, and the company has also explored the possibility of offering the software they develop to other game developers.

What advice would you give to students interested in getting into AI or Robotics?

Students should focus on computer science and AI training, in addition to robotics specifically. The AI and robotics fields will likely grow tremendously within this decade, and are both accessible for individuals interested in pursuing them. It is important to develop a passion for a specific topic, and also important to tell a compelling story about your interests, and why you want to pursue this area.

AI Applications Across Industries

AI is being applied in a variety of ways, including in robotics, gaming, and self-driving cars.

Here are some of the ways that AI is being applied, according to the sources:
- Robotics: AI is being used to create robots that can perform generalized tasks, such as helping with household chores like laundry and dishes. AI is also being used to develop robots with more human-like capabilities. The development of AI is seen as one of the biggest unlocks for advancements in robotics.
- Gaming: AI is being used to create human-like agents in games that can behave like real humans and can provide a variety of new applications, including serving as non-player characters (NPCs), coaching players, playing with players, and play testing games. AI can also be used to generate games based on a user’s personal interests.
- Self-driving cars: AI is used in self-driving cars to enable the car to make decisions about how to drive in different scenarios. AI is used in the perception systems of self-driving cars to help them understand the environment around them. The technology for self-driving cars is improving rapidly and could be available in the next 3-5 years.
- Augmented and Virtual Reality (AR/VR): AI is used in AR/VR to perform tasks like depth sensing, facial tracking for avatars, and 3D reconstruction. AI is also used to create more immersive experiences in VR environments.
- Personal Use: AI is being used by students to help them complete homework assignments. AI models can be run locally, decreasing costs, with the help of personal hardware like Nvidia GPUs.
The sources also note that the cost of hardware for AI and robotics is decreasing, which is making these technologies more accessible to average consumers. The increasing accessibility of AI could lead to a variety of new applications in the future.

The Rise of General-Purpose Robots

Robotics is experiencing significant advancements, largely due to progress in AI and decreasing hardware costs. Here’s a breakdown of key developments:
- AI as a Catalyst: The development of AI is considered one of the biggest unlocks for advancements in robotics. This is because AI provides the “brains” for robots, enabling them to perform complex and generalized tasks, rather than just specific, pre-programmed actions. This shift mirrors the progress in AI from specific models to general models like Large Language Models.
- Generalized Robots: There’s a growing focus on creating generalized robots that can perform a variety of tasks, rather than specialized robots designed for just one function. For example, the goal is to have one robot that can do laundry, dishes, and other household chores. This is a departure from traditional industrial robots that perform the same task repeatedly.
- Humanoid Robots: There is an increasing interest in humanoid robots due to their versatility and compatibility with human-designed environments. The human form factor is ideal because it can handle a wide range of tasks that humans do in their daily lives, like walking, playing sports, and using tools.
- Cost Reduction: The cost of hardware for robots is decreasing. This is due to advancements in manufacturing and the development of cheaper components. The decrease in costs makes personal robots more affordable for the average consumer. For example, the development of Nvidia GPUs has made it cheaper to train large AI models.
- Simulation and Training: The gap between simulations and reality is closing, allowing robots to be trained in 3D simulations and then generalized to the real world. This means that robots can be trained in virtual environments, like video games, before being deployed in the real world.
- Applications:
- Household chores: Robots are being developed to assist with tasks like laundry, dishwashing, and cooking.
- Manufacturing: Robots are becoming more efficient at manufacturing.
- Gaming: AI-powered robots with bodies are being developed for games, where they act as human-like agents.
The advancements in AI are not only making robots smarter, but also making them more adaptable and capable of performing more human-like tasks. The decreasing costs and advancements in hardware are making robots more accessible, paving the way for their integration into daily life.

Bridging the Sim-to-Real Gap in Robotics

The “sim-to-real gap” refers to the difference between simulations and real-world environments, and how well an agent trained in a simulation can perform in reality. According to the sources, this gap is closing, especially in the field of robotics.

Here’s a breakdown of the sim-to-real gap:
- Closing the Gap: The sources indicate that the sim-to-real gap is getting smaller, as simulations become more realistic. For example, video games like GTA and other AAA games are becoming more like reality, with higher graphics fidelity.
- Training in Simulation: Robots can now be trained in 3D video game or simulation environments. By creating simulations that are close to real-world conditions, AI agents can be trained to handle various scenarios.
- Generalization: Once an AI agent is trained in a simulation, it can then be generalized to a real robot. This means that the robot should be able to perform the tasks it learned in the simulation in the real world.
- Applications in Robotics: The ability to train robots in simulations is particularly useful for robotics. AI can be trained in simulations and then applied to robots to do things like household tasks. This is important because it allows for a safe and cost-effective way to develop and test AI agents before they are deployed in the real world.
- Implications for the Future: The closing of the sim-to-real gap implies that robots may soon be able to operate in real-world environments with minimal additional training, allowing for the development of versatile and adaptable robots.
In summary, the decreasing differences between simulations and reality is enabling robots to be trained in virtual environments and then applied to the real world. This could make the development of robots faster and more efficient, and bring the vision of generalized robots performing a variety of tasks closer to reality.

Humanoid Robots: Capabilities and Future Impact

Humanoid robots are a significant area of focus in the field of robotics, with the goal of creating robots that resemble humans in form and function. Here’s a breakdown of their key aspects, according to the sources:
- Form Factor: The emphasis on humanoid form is important because humans are able to do a wide variety of tasks in different scenarios. The design of our environments, such as houses and apartments, are also designed with the human form factor in mind. This makes humanoid robots more versatile than specialized robots designed for one task.
- General Purpose: The goal of humanoid robotics is to create generalist robots that can perform many different tasks, as opposed to specialized robots that can only do one thing. This mirrors the way that humans are able to perform a variety of tasks in their personal lives.
- Capabilities: Humanoid robots are designed to be able to manipulate objects in the physical space with their arms. The sources suggest that in the future, humanoid robots could potentially do a range of household tasks, such as laundry, dishes, and cooking.
- AI Integration: The advancements in AI are a major driver for the progress in humanoid robotics. AI provides the “brains” that allow humanoid robots to perform a variety of tasks. The development of general AI models is also allowing for the development of generalist robots.
- Cost Reduction: The cost of humanoid robots is decreasing, making them more accessible to consumers. In the past, humanoid robots cost tens or hundreds of thousands of dollars, but now some cost only a few thousand dollars.
- Relationship to Other Tech: The concept of a general-purpose humanoid robot is similar to the way smartphones became a general-purpose tool, combining the functions of many different devices. The idea is that a humanoid robot could combine the functions of many different appliances, such as washing machines, dishwashers, and ovens.
- Simulations: Humanoid robots can be trained in 3D simulations and video games before being deployed in the real world. This closes the simulation to reality gap and makes training the robots more efficient.
- Applications: In addition to household chores, humanoid robots could also be used in games, where they could serve as non-player characters (NPCs). The sources suggest that these robots could play games with users, or act as coaches.
- Future Impact: The development of humanoid robots could lead to a significant increase in leisure time for humans. By automating household tasks, humanoid robots could save people several hours per week, which could be used for other activities.
In summary, humanoid robots are a rapidly developing field that has the potential to significantly change our lives. Advancements in AI and decreasing hardware costs are making these robots more capable and accessible. The ultimate goal is to create a robot that can perform a wide range of tasks and interact with the world in a human-like way.

AI Agents: Applications and Ethics

AI agents are a central focus of current advancements in AI, particularly in the fields of robotics and gaming. Here’s a breakdown of their key characteristics and applications, based on the sources:
- Definition: AI agents are essentially AI-powered entities that can perceive their environment, make decisions, and take actions to achieve a specific goal. In essence, they are AI given a form factor.
- Embodiment: AI agents can be embodied in a physical robot, enabling them to interact with the real world. They can also be given a virtual “body” in a 3D space like a game, enabling interaction in that environment.
- Human-like Behavior: A key focus of AI agent development is creating agents that can behave like humans. This involves not only performing tasks but also exhibiting emotions and building relationships. The source suggests that AI agents can be so human-like that they can be mistaken for real people, which can raise ethical concerns.
- Applications in Gaming: AI agents are being used in gaming to create more dynamic and engaging experiences.
- Non-Player Characters (NPCs): AI agents can act as NPCs in games. These characters can interact with players and adapt to their actions, making the game feel more alive.
- Coaches and Companions: AI agents can serve as coaches or companions, helping players improve their skills or providing company when friends are not online.
- Playtesting: AI agents can be used to playtest games, finding bugs and providing feedback to developers.
- Game Generation: AI agents can also generate games based on a user’s personal interests.
- Applications in Robotics: AI agents are critical for developing generalized robots that can perform many different tasks. They can help robots perceive their environment, make decisions, and take actions.
- Household Tasks: AI agents can be incorporated into robots that can assist with household chores.
- General Purpose: AI agents are also being developed for general purpose humanoid robots.
- Training: AI agents can be trained using simulations and virtual environments, which can then be applied to robots in the real world. This helps close the sim-to-real gap.
- Role in Infinite Games: AI agents are central to the concept of “infinite games,” which are games that can be played forever, where the world and characters adapt based on the player’s interests.
- Cost and Accessibility: The cost of AI agents is decreasing, due to advances in hardware and manufacturing. This makes them more accessible for a variety of applications, from gaming to robotics.
In summary, AI agents are transforming both virtual and physical environments. Their ability to learn, adapt, and interact in human-like ways makes them a key technology for future advancements in gaming, robotics, and other fields. The ethical concerns surrounding AI agents, particularly those that are so human-like they could be mistaken for real people, are important to consider.

From freeCodeCamp to CTO with Robotics Engineer Peggy Wang [Podcast #159]

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 14, 2025
AI Engineering: From Math to Generative AI
This text outlines a comprehensive roadmap for becoming a world-class AI engineer in 2025. Key areas of study include essential mathematics (linear algebra, calculus, statistics), data science principles, traditional machine learning, and deep learning. The roadmap emphasizes bridging the gap between theoretical knowledge and practical application, particularly focusing on generative AI and large language models. It also highlights the importance of Python programming and the ethical considerations within AI development. Finally, the text promotes a boot camp offered by Lunar Tech as a means to acquire these skills.

AI Engineering Study Guide

Quiz
1. What is the core function of AI engineering, and how does it relate to data science and machine learning?
2. AI engineering focuses on the design, building, and deployment of AI systems to solve real-world problems. It bridges the gap between data science, which develops models, and practical application by making models work reliably in real-world settings.
3. Name three industries where AI engineering is having a significant impact and give a specific example in each.
4. Healthcare, where AI is used to analyze medical images; finance, for fraud detection and algorithmic trading; and retail/e-commerce, for personalized recommendations and inventory management.
5. What is the role of mathematics in becoming a world-class AI engineer?
6. Mathematics provides the fundamental understanding needed to work with both traditional machine learning and cutting-edge AI. This includes topics from high school math, linear algebra, and calculus, which are critical for understanding model optimization and algorithms.
7. Why is a solid understanding of statistics essential for AI engineers?
8. Statistics are important for data analysis and understanding data, especially for data modeling. It helps with understanding probabilities, distribution, inferential statistics and performing hypothesis testing.
9. What is the importance of having data science skills for an AI engineer?
10. Data science skills are essential for AI Engineers because they enable them to clean, source, pre-process, and analyze data. This also includes identifying missing data, recognizing anomalies, performing normalization, and conducting exploratory data analysis, all of which improve model performance.
11. Briefly define traditional machine learning and provide 2-3 examples of algorithms that fall under this category.
12. Traditional machine learning involves using algorithms to learn from data and make predictions. Algorithms include linear regression, logistic regression, decision trees, or clustering algorithms like K-means.
13. How does deep learning differ from traditional machine learning, and what is the basic architecture of neural networks?
14. Deep learning uses more complex neural networks that can learn from larger amounts of data, unlike traditional machine learning models. A neural network consists of layers of interconnected neurons, including input, hidden, and output layers, along with activation functions and backpropagation.
15. Why is Python considered an important tool for AI Engineers and what is its role?
16. Python is important because it offers libraries, such as PyTorch and TensorFlow, that are used for AI and data science tasks. These libraries allow AI Engineers to create and implement machine learning and deep learning models.
17. What are the major elements in generative AI models such as GANs, Variational Autoencoders and Transformer Models?
18. Generative AI models include GANs (Generative Adversarial Networks), which use generators and discriminators; Variational Autoencoders, which learn probability distributions; and Transformer models, which use attention mechanisms and form the backbone of large language models.
19. What role do Large Language Models play in current AI technology?
20. Large Language Models such as the GPT family, Llama, and others are driving major advancements in current AI technologies. They use Transformer architecture and they are used in chat interfaces and various other applications through pre-training, fine-tuning and prompt engineering.
Quiz Answer Key
1. AI engineering focuses on the design, building, and deployment of AI systems to solve real-world problems. It bridges the gap between data science, which develops models, and practical application by making models work reliably in real-world settings.
2. Healthcare, where AI is used to analyze medical images; finance, for fraud detection and algorithmic trading; and retail/e-commerce, for personalized recommendations and inventory management.
3. Mathematics provides the fundamental understanding needed to work with both traditional machine learning and cutting-edge AI. This includes topics from high school math, linear algebra, and calculus, which are critical for understanding model optimization and algorithms.
4. Statistics are important for data analysis and understanding data, especially for data modeling. It helps with understanding probabilities, distribution, inferential statistics and performing hypothesis testing.
5. Data science skills are essential for AI Engineers because they enable them to clean, source, pre-process, and analyze data. This also includes identifying missing data, recognizing anomalies, performing normalization, and conducting exploratory data analysis, all of which improve model performance.
6. Traditional machine learning involves using algorithms to learn from data and make predictions. Algorithms include linear regression, logistic regression, decision trees, or clustering algorithms like K-means.
7. Deep learning uses more complex neural networks that can learn from larger amounts of data, unlike traditional machine learning models. A neural network consists of layers of interconnected neurons, including input, hidden, and output layers, along with activation functions and backpropagation.
8. Python is important because it offers libraries, such as PyTorch and TensorFlow, that are used for AI and data science tasks. These libraries allow AI Engineers to create and implement machine learning and deep learning models.
9. Generative AI models include GANs (Generative Adversarial Networks), which use generators and discriminators; Variational Autoencoders, which learn probability distributions; and Transformer models, which use attention mechanisms and form the backbone of large language models.
10. Large Language Models such as the GPT family, Llama, and others are driving major advancements in current AI technologies. They use Transformer architecture and they are used in chat interfaces and various other applications through pre-training, fine-tuning and prompt engineering.
Essay Questions
1. Discuss the ethical considerations that AI engineers must be aware of, including specific examples of how these issues can manifest in real-world applications.
2. Explain the significance of both traditional machine learning and deep learning techniques for AI engineers, and provide scenarios where each would be most appropriate.
3. Describe the end-to-end process an AI engineer might follow in a typical project, from the initial problem definition to the deployment and maintenance of a solution.
4. Analyze the role of mathematics and statistics in AI engineering, explaining how specific concepts underpin the development and improvement of AI models.
5. Assess the current trends and future directions of generative AI, emphasizing its potential impact on different industries and the skills needed for success in this field.
Glossary of Key Terms
- AI Engineering: The practice of designing, building, and deploying AI systems to solve real-world problems. It integrates software engineering, machine learning, and data science.
- Data Science: A field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data.
- Machine Learning (ML): A type of artificial intelligence that allows computer systems to learn from data without explicit programming.
- Deep Learning (DL): A subset of machine learning that utilizes neural networks with multiple layers (deep neural networks) to analyze data.
- Neural Networks: A computational model inspired by the structure and function of the human brain. It consists of interconnected nodes (neurons) organized in layers.
- Linear Algebra: A branch of mathematics concerned with vector spaces and linear mappings between these spaces. It’s crucial for understanding AI concepts like matrices, vectors, and transformations.
- Calculus: A branch of mathematics focused on continuous change, dealing with concepts like derivatives, integrals, and gradients. It’s essential for optimizing AI models.
- Statistics: A branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data, involving concepts like probabilities, distribution, inferential statistics and hypothesis testing.
- Data Pre-processing: The process of preparing raw data for use in machine learning models. This includes cleaning, normalization, and feature engineering.
- Feature Engineering: The process of creating new variables from existing data to improve the performance of machine learning models.
- Supervised Learning: A machine learning approach where models learn from labeled training data, where input data is paired with corresponding outputs.
- Unsupervised Learning: A machine learning approach where models learn from unlabeled data to identify patterns or clusters.
- Classification: A machine learning task where models assign data points to predefined categories.
- Regression: A machine learning task where models predict continuous numerical values.
- Generative AI: AI models that can generate new data similar to their training data, including images, text, and other forms of content.
- Generative Adversarial Networks (GANs): A type of generative model consisting of two neural networks, a generator and a discriminator, that compete with each other to produce new data.
- Variational Autoencoders (VAEs): A type of generative model that learns a probabilistic latent space representation of input data.
- Transformer Models: A neural network architecture that uses attention mechanisms to process input data, especially sequential data like text. They form the basis for many Large Language Models.
- Large Language Models (LLMs): AI models trained on vast amounts of text data that can understand, generate, and interact with human language.
- Pre-training: Training a model on a large, general dataset to learn foundational representations.
- Fine-tuning: Training a pre-trained model on a specific dataset and task to adapt it for a particular application.
- Prompt Engineering: The process of designing input prompts for Large Language Models to elicit desired responses.
- Reinforcement Learning with Human Feedback (RLHF): A technique used to improve the performance of AI models by training them based on human preferences.
- Tokenization: The process of breaking down text into individual tokens to feed into a model.
- Embedding: A vector representation of input elements such as words, sentences or paragraphs.
- Attention Mechanism: The part of the Transformer that allows the model to prioritize different input parts during processing.
- Bias: A tendency in a model or algorithm towards an opinion or result, due to issues with the input or the design of the system.
- Overfitting: A situation where a model learns the training data too well, leading to poor generalization on new, unseen data.
AI Engineering Roadmap 2025

Okay, here is a detailed briefing document based on the provided text, outlining the main themes and important ideas, and including relevant quotes:

Briefing Document: AI Engineering Roadmap 2025

Overview

This document summarizes key information from a presentation outlining a roadmap for becoming a successful AI engineer in 2025. The presentation, delivered by D. Vasan of LunarTech, emphasizes a comprehensive approach, covering foundational mathematics through advanced AI implementations including Large Language Models (LLMs). The core message is that AI engineering is a critical and in-demand field requiring both theoretical knowledge and practical skills, enabling professionals to bridge the gap between research and real-world application.

Main Themes
1. Definition and Scope of AI Engineering:
- AI engineering is the practice of designing, building, and deploying AI systems to solve real-world problems. It is not just about creating models but also about making those models functional, reliable, and valuable.
- It’s at the intersection of software engineering, machine learning, and data science. The presentation highlights that, “AI engineering is this practice of designing building and deploying AI systems that solve real world problems. It sits in this intersection of software engineering machine learning and data science…”
- AI engineers take models developed by data scientists and ensure they are integrated into systems, run reliably, and deliver actionable insights. “the data scientists often focus on analyzing data or predicting something or developing models AI Engineers take these models and make them work in the real world settings and with much more advanced models they create systems that process data make decisions and deliver actionable insights…”
- AI engineers work with advanced models such as deep learning and neural networks, and the emphasis is on practical problem-solving not just academic knowledge. “it’s not just about building models it’s about making sure that those models actually solve problems and deliver value for the business or this public Enterprise and that’s why AI engineering is such a critical role in today’s Tech ecosystem it’s where this Cutting Edge research meets the Practical industry impactful implementation…”
1. Impact of AI Engineering Across Industries:
- AI engineering is transforming numerous industries. Examples include:
- Healthcare: Analyzing medical images, predicting patient outcomes, and assisting in drug discovery.
- Finance: Fraud detection, algorithmic trading, real-time data processing.
- Retail/E-commerce: Personalized recommendations, price optimization, inventory management.
- Entertainment: Personalized content recommendations on streaming platforms, new content creation tools.
- Autonomous Vehicles: Navigation, object detection, and decision-making systems.
- The growing demand and high salaries in the field highlight the career potential, “those are highly competitive just 40 ENT roll they start around 80 up to 120k at least for the midlevel engineers this is uh 120k to 180k in us and where senior roles this can take all the way from 200 up to 750k in the US dollar”
1. Essential Skill Sets for AI Engineers:
- The presentation breaks down necessary skills into several categories:
- Mathematics:High school math, linear algebra, calculus, and elements of game theory are needed.
- Specifically, understanding vectors, matrices, derivatives, integrals, and the concept of Nash equilibrium. Linear Algebra is critical: “you must understand linear algebra so when it comes to linear algebra let me tell you specifically what I mean not the entire linear algebra but really to understand the norm of a vector this understanding of vector and matrices…”.
- Emphasis is on selected topics from different fields and levels, not necessarily super-advanced concepts. “…not the entire universe of mathematics or the super advanced stuff but really the fundamentals and um these are selected topics from different uh levels…”
- Statistics:Understanding probabilities, distributions (PDFs, CDFs), samples, random variables, and statistical measures.
- Also, concepts like hypothesis testing, confidence intervals, and linear regression. “first up of course understanding this concept of probabilities to know what the probabilities are what is its concept uh why it is used for this concept of probability distribution functions the PDFs the cumulative distribution functions or the cdfs…”
- Data Science Skills:Collecting, cleaning, preprocessing, visualizing, and feature engineering data.
- Ensuring data is relevant, unbiased, and of good quality. “as an AI engineer you will need to understand how to clean data how to Source data how to collect it if you don’t have an AI engineer next to you and also how to pre-process data…”
- Traditional Machine Learning:Understanding classification, regression, supervised/unsupervised learning algorithms such as linear regression, logistic regression, decision trees, and various ensemble methods.
- Model evaluation metrics, training/testing/validation cycles and resampling methods. “what I mean by traditional machine learning I mean to um understand this concept of classification regression supervised learning unsupervised learning these different algorithms that fall under these categories like uh linear regression logistic regression decision trees…”
- Deep Learning:Understanding neural network architecture, activation functions, forward and backward passes, optimization algorithms, different types of layers, and their applications.
- Knowing concepts like the vanishing gradient problem, batch normalization, and various deep learning model architectures (CNNs, RNNs, LSTMs, GANs). “you need to understand how the Deep learning differs from the traditional machine learning you need to understand the architecture of neural networks…”
- Programming (Python):Proficiency in Python, especially with libraries relevant for data science, machine learning, and deep learning (e.g., PyTorch, TensorFlow).
- Understanding data structures, algorithms, and the practical implementation of ML/DL models. “and my suggestion would be to learn next the python to understand how you can um uh create uh lists variables how you can load data different sorts of data… training a machine learning model training um deep learning model how to make use of uh pytorch which is a deep learning framework in python as well as tanor flow…”
- Generative AI & LLMsUnderstanding different models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers.
- Deep knowledge of the transformer architecture and large language model pre-training, fine-tuning, and reinforcement learning.
- Practical skills in prompt engineering, evaluating, and optimizing large language models. “First up you need to understand the AI foundations and you need to understand um where you can apply generative AI before you get into the theoretical part so understanding also the moral development cycle when it comes to generative Ai and training techniques will be really important…”
- AI Ethics:Understanding ethical principles, bias in AI, privacy, data security, and relevant regulations.
1. Step-by-Step Learning Process:
- The presentation advocates for a structured approach to learning, starting with fundamental mathematics and statistics.
- Moving on to data science, then traditional machine learning, followed by deep learning, python and lastly large language models and generative AI.
1. Emphasis on Practical Application:
- The roadmap emphasizes bridging the gap between theory and real-world application.
- The focus is on solving problems and creating valuable solutions rather than just academic knowledge. “AI engineering is all about solving real problems not just the theoretical knowledge being able to understand all the theory the foundational knowledge along with the implementation of each of these different topics ICS in the reality will be really important for you to become a job ready professional”
- The importance of not just understanding the models but being able to create new models, algorithms, and work at companies leading the AI innovation is highlighted.
Key Takeaways
- AI engineering is a dynamic field requiring diverse skill sets.
- A structured learning approach is essential to master the complexities of AI.
- Practical experience and project-based learning are crucial for becoming job-ready.
- AI engineers must be aware of ethical implications and ensure responsible AI practices.
- The career prospects for well-trained AI engineers are excellent, with high demand and salaries.
This briefing document provides a comprehensive summary of the AI Engineering roadmap, highlighting the critical areas of focus and practical steps to become a successful AI engineer in the evolving landscape of technology.

AI Engineering in 2025: A Comprehensive Guide

Frequently Asked Questions about Becoming an AI Engineer in 2025
1. What exactly is AI Engineering, and how does it differ from Data Science? AI engineering is the practice of designing, building, and deploying AI systems to solve real-world problems. It’s an intersection of software engineering, machine learning, and data science. While data scientists primarily focus on analyzing data, developing models, and making predictions, AI engineers take these models and make them work reliably and efficiently in real-world settings. They ensure that models are scalable, can handle different conditions, and deliver actionable insights. AI engineers also often work with more advanced models like deep learning models and neural networks. Essentially, AI engineering is the bridge between AI research and practical, impactful implementation.
2. In what industries are AI engineers making a significant impact? AI engineering is having a transformative impact across numerous industries, including:
- Healthcare: Developing systems for analyzing medical images, predicting patient outcomes, and assisting in drug discovery and patient care.
- Finance: Creating real-time systems for fraud detection and algorithmic trading that can handle sensitive financial data securely.
- Retail and E-commerce: Designing algorithms for personalized recommendations, dynamic pricing, and inventory management.
- Entertainment: Building systems for personalized content recommendations and developing generative AI tools for content creation.
- Autonomous Vehicles: Developing algorithms and hardware integrations for safe and reliable navigation, object detection, and decision-making.
1. This is not an exhaustive list but highlights the wide applications of AI engineering across different sectors.
2. What are the essential “must-have” skills for aspiring AI engineers? To become a proficient AI engineer, you need a diverse skill set that includes:
- Mathematics: A solid understanding of topics such as high school math, linear algebra (vectors, matrices, linear transformations), calculus (derivatives, integrals, optimization), and game theory (Nash equilibrium).
- Statistics: Key statistical concepts including probability, probability distribution functions, sampling, random variables, measures of central tendency, variance, correlation, hypothesis testing, bias theorem, confidence intervals, and statistical significance.
- Data Science: Skills to clean, source, collect, and pre-process data, including handling missing data, anomalies and outliers, normalization, filtering, and grouping. Also crucial is exploratory data analysis and feature engineering.
- Traditional Machine Learning: A thorough understanding of algorithms for classification, regression, supervised, and unsupervised learning. This includes the mathematics and statistics behind them and when to use which model. Also important is to know how to evaluate a model and be familiar with training, testing and validation cycles, as well as evaluation metrics.
- Deep Learning: Knowledge of neural network architecture, forward and backward pass, backpropagation, loss functions, optimization algorithms, and the ability to evaluate model performance. Familiarity with different neural network architectures such as CNNs, RNNs, GNNs, GRUs and LSTMs is also important.
- Programming (Python): Fluency in Python and its libraries for data science (Seaborn, Matplotlib) and machine learning (PyTorch, TensorFlow). Knowledge of data structures, algorithms and the ability to implement machine learning and deep learning models in Python is essential.
- Generative AI: A strong understanding of foundational models including generative adversarial networks (GANs), variational autoencoders, and transformers. Also important is understanding the cycle of pre-training, fine-tuning, prompt engineering, and reinforcement learning in generative AI models. Finally, being familiar with and using tools like huggingface to be able to make better use of open source models.
1. Why is mathematics so foundational to AI engineering? What specific areas should I focus on? Mathematics is crucial because it underlies the core mechanisms of AI, from traditional machine learning algorithms to cutting-edge deep learning models. The key areas include:
- High School Mathematics: Basic algebra, equations, geometry, and trigonometry are a foundation.
- Linear Algebra: Understanding vectors, matrices, Cartesian coordinates, dot products, linear systems, and matrix factorization.
- Calculus: Knowing derivatives, integrals (including double integrals), gradients, and their use in optimization.
- Game Theory: Basic understanding of Nash equilibrium.
1. Why is statistics important for AI engineers, and what specific statistical topics are key?
2. Statistics is essential for AI engineers to understand data and develop effective models. Key topics include:
- Probability: Basic concept of probability, probability distribution functions, and cumulative distribution functions.
- Basic Statistics: The mean, median, variance, standard deviation, mode, covariance and correlation and how to calculate them.
- Sampling: Understanding the difference between a sample and a population and what it means to have a representative sample.
- Probability distributions: Understanding probability distribution functions including normal, binomial and Bernoulli.
- Hypothesis testing: The need for hypothesis testing, the concept of null and alternative hypotheses, type one and type two error and the use of statistical tests.
- Inferential statistics: Concepts like the central limit theorem and the law of large numbers.
1. Can you elaborate on what “traditional machine learning” means and why it is crucial to master? Traditional machine learning refers to the more established algorithms and methods used for tasks like classification, regression, and clustering, using models like linear regression, decision trees, support vector machines, and K-means. Mastering traditional machine learning is crucial for several reasons:
- Understanding fundamentals: It provides the essential understanding of the underlying principles that are used in more advanced deep learning models.
- Problem-solving: Not every problem requires a complex deep learning model. AI engineers should be able to select the appropriate solution by understanding the business problem and selecting the appropriate model which can often be a simple traditional machine learning model instead of a large and expensive deep learning one.
- Efficiently evaluate models: understanding the evaluation cycles as well as the proper evaluation metrics.
- Practical application: It allows you to approach real-world problems from a practical and efficient perspective without unnecessarily using computationally expensive approaches.
1. How do deep learning and generative AI fit into the AI engineering landscape? Deep learning is the bedrock of modern AI, enabling the development of generative AI.
- Deep Learning: Deep learning involves neural networks that can learn complex patterns from data. It’s essential for building models that can power various applications from computer vision to natural language processing and large language models. A deep understanding of neural networks, activation functions, optimization algorithms, and evaluation techniques is crucial.
- Generative AI: Generative AI builds on deep learning to create new content, such as text, images, and audio. This field includes models like GANs, variational autoencoders, and transformers, which are essential for creating tools like ChatGPT, DALL-E, and other cutting-edge AI applications. Knowing the concepts of pre-training, fine tuning, reinforcement learning and prompt engineering is also necessary.
1. What is the process for training large language models, and what do I need to master to call myself an expert in LLMs? Mastering large language models involves several key steps:
- Understanding language models: The basics of predicting the next word and the evolution of language models.
- Understanding key LLMs: Knowing the unique traits of LLMs such as gpts, llamas, falcon, and cloud sonnet.
- Knowing transformer architectures: Understanding the basic concepts of positional encoding, embeddings and multi headed attention mechanisms.
- Data Preparation: Understanding how to clean, process, and prepare data, as well as how to ingest the data into an AI model.
- Pre-training: The basic concept of mask language modeling and auto regressive language modeling.
- Fine-tuning: Understanding how to fine-tune on single and multi-task scenarios and the various methods such as parameter efficient fine tuning.
- Reinforcement Learning with Human Feedback (RLHF): Understanding why it’s used to make models smarter.
- Prompt Engineering: The best practices for creating effective and optimized prompts.
- Retrieval Augmented Generation (RAG): Knowledge of RAG systems and how to combine vector databases, fine-tuning, and agentic RAGs.
- Evaluation and Optimization: Being able to evaluate large language models by knowing various benchmarks, quantization, knowledge distillation, and using Alm Ops to productionize an LLM.
- Ethics: Understanding the ethical implications of AI, bias in AI, privacy, data security and regulations.
1. Mastering these areas will enable you to build and utilize powerful LLM-based applications effectively.
AI Engineering: Skills, Applications, and Career Prospects

AI engineering is the practice of designing, building, and deploying AI systems to solve real-world problems. It combines software engineering, machine learning, and data science.

Here’s a breakdown of key aspects of AI engineering:
- Role in the AI Ecosystem: Data scientists focus on analyzing data, making predictions, and developing models, while AI engineers take these models and implement them in real-world settings. They ensure models work reliably under different conditions and deliver actionable insights.
- Scope: AI engineering is not limited to one field, and it is changing industries worldwide.
- Impact across Industries: AI engineering is impacting numerous industries, including healthcare, finance, retail and e-commerce, entertainment, and autonomous vehicles.
- In healthcare, AI engineers build systems for analyzing medical images, predicting patient outcomes, and assisting with drug discovery.
- In finance, they create secure, real-time systems for fraud detection and algorithmic trading.
- In retail and e-commerce, they design systems for personalized recommendations and optimized pricing.
- In entertainment, AI is used for content recommendations and generative tools.
- In autonomous vehicles, AI engineers design the algorithms and hardware integration for safe and reliable navigation.
Skills Needed to Become an AI Engineer:

The sources outline a roadmap for becoming an AI engineer, highlighting essential skills, which can be summarized as follows:
- Mathematics: A solid foundation in mathematics is essential. This includes:
- High school mathematics (basic algebra, geometry, and trigonometry)
- Linear algebra (vectors, matrices, linear transformations)
- Calculus (derivatives, integrals, optimization)
- Game theory (concepts like Nash equilibrium)
- Statistics: Understanding statistical concepts is crucial for data analysis and model building. Key topics include:
- Probability and probability distributions
- Descriptive statistics (mean, median, variance)
- Hypothesis testing and statistical significance
- Dimension reduction techniques
- Data Science Skills: AI engineers need strong data science skills to handle data effectively. This involves:
- Data cleaning and preprocessing
- Identifying and handling missing data and outliers
- Data visualization
- Feature engineering
- Traditional Machine Learning: A strong grasp of traditional machine learning algorithms is necessary. This includes:
- Understanding classification and regression problems
- Supervised and unsupervised learning algorithms
- Model evaluation techniques
- Deep Learning: Understanding deep learning is essential for modern AI. This involves:
- Neural network architectures and training
- Activation functions and optimization algorithms
- Understanding different types of neural networks such as CNNs and RNNs
- Programming: Proficiency in programming languages, particularly Python, is essential. This includes:
- Understanding basic data structures and algorithms in Python
- Using AI frameworks like PyTorch and TensorFlow
- Generative AI: Generative AI is a highly in-demand skill for AI engineers. This involves:
- Understanding foundational generative models like GANs and variational autoencoders
- Understanding Transformer models and their architecture
- Knowledge of large language models, including pre-training, fine-tuning, and prompt engineering
- AI Ethics: AI engineers need to be aware of the ethical implications of AI and ensure their models are created and used responsibly
- This involves understanding bias in AI, privacy, and data security
Additional Skills: * Understanding how to evaluate and optimize large language models using tools and techniques for benchmarking, quantization, and pruning * Knowledge of using tools like Langchain and Flask to productionalize AI models * Understanding the cycle of pre-training, fine-tuning, prompt engineering, reinforcement learning, evaluation, and optimization with large language models

Career Prospects:

AI engineering is a high-demand field with competitive salaries. Entry-level roles may start around $80,000 to $120,000 per year, while senior roles can reach up to $750,000 in the US.

In summary, AI engineering is a critical field that requires a blend of theoretical knowledge and practical implementation skills. It is a career that is both challenging and rewarding, with ample opportunities for innovation and impact across diverse industries.

Essential Skills for AI Engineers

AI engineering requires a diverse set of skills, combining theoretical knowledge with practical implementation. These skills span mathematics, statistics, data science, machine learning, deep learning, programming, and generative AI. Here’s a breakdown of the essential skills for an AI engineer:
- Mathematics: A strong foundation in math is crucial for understanding AI algorithms. This includes:
- High school mathematics, including algebra, geometry, and trigonometry.
- Linear algebra, which is essential for understanding both traditional machine learning and deep learning, involving vectors, matrices, and linear transformations. Key concepts include vector norms, matrix operations, and solving linear systems using matrices.
- Calculus, which is needed to understand gradients, derivatives, and optimization techniques. This involves understanding single and double integrals, and using derivatives and integrals in model optimization.
- Game theory, especially concepts like Nash equilibrium, which is important for understanding generative adversarial networks.
- Statistics: Understanding statistical concepts is essential for data analysis and model building. Key areas include:
- Probability and probability distributions.
- Descriptive statistics such as mean, median, variance, and standard deviation.
- Hypothesis testing and statistical significance.
- Understanding of sample versus population and the use of representative samples.
- Knowledge of probability distribution functions (PDFs) and cumulative distribution functions (CDFs), as well as common distributions like normal and binomial distributions.
- Understanding linear regression and ordinary least squares, and concepts like bias, consistency, and efficiency of parameters.
- Familiarity with confidence intervals and statistical tests such as the student T-test, F-test, and ANOVA test, along with the concept of the p-value.
- Knowledge of inferential statistics such as the central limit theorem.
- Dimension reduction techniques like Principal Component Analysis (PCA).
- Data Science Skills: AI engineers must be proficient in data handling. This involves:
- Data cleaning and preprocessing, including handling missing data and outliers, as well as data normalization.
- Data visualization, which helps in understanding data trends and identifying outliers.
- Feature engineering, which involves creating new variables from existing data to improve model performance.
- Traditional Machine Learning: A thorough understanding of machine learning is necessary. This includes:
- Understanding classification and regression problems, and supervised and unsupervised learning algorithms.
- Knowing how to use various algorithms like linear regression, logistic regression, decision trees, and ensemble methods like bagging, boosting, XGBoost, and LightGBM.
- Understanding model evaluation techniques, including training, testing, and validation cycles, and various evaluation metrics depending on the problem.
- Deep Learning: Knowledge of deep learning is essential for working with modern AI systems. This includes:
- Understanding the architecture of neural networks, neurons, perceptrons, activation functions, and hidden layers.
- Knowledge of the forward pass, backward pass, and backpropagation algorithm, as well as loss functions and optimization algorithms like gradient descent and its variants.
- Understanding concepts like vanishing and exploding gradient problems, and batch normalization.
- Understanding different types of neural networks like CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), GNNs (Graph Neural Networks), GRUs (Gated Recurrent Units), and LSTMs (Long Short-Term Memory networks).
- Understanding generative models, including autoencoders.
- Programming: Proficiency in a programming language, especially Python, is crucial. This includes:
- Understanding data structures and algorithms in Python.
- Using AI frameworks like PyTorch and TensorFlow.
- Ability to work with different types of data including images, text, and audio and also data visualization.
- Generative AI: This is a highly in-demand skill for AI engineers. This involves:
- Understanding foundational generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs).
- Deep knowledge of Transformer models, including attention mechanisms, embeddings, and positional encodings.
- Understanding large language models (LLMs), including their pre-training, fine-tuning, and prompt engineering. Key areas include language models, engrams, encoder- and decoder-based architectures, tokenization, and embeddings.
- Knowledge of reinforcement learning with human feedback and how to apply this.
- Understanding how to prepare data for LLMs and use prompt templates and other structures effectively.
- Understanding retrieval augmented generation (RAG) systems and vector databases.
- AI Ethics: AI engineers need to understand ethical principles and regulations. This includes:
- Understanding the ethical considerations when using AI, including bias, privacy, and data security.
- Knowledge of AI regulations and governance, like the AI act from Europe and GDPR.
- Additional Skills:
- Understanding how to evaluate and optimize large language models using benchmarks, quantization, and pruning techniques.
- Knowledge of using tools like Langchain and Flask to deploy and productionize AI models.
- Understanding the full lifecycle of large language models, including pre-training, fine-tuning, prompt engineering, reinforcement learning, evaluation, and optimization.
Mastering these skills will enable an AI engineer to bridge the gap between research and practical application, solve real-world problems, and innovate within the field. The sources emphasize that it’s not just about theoretical knowledge but also about practical implementation and the ability to adapt to the rapidly changing landscape of AI.

Data Science for AI Engineers

Data science skills are a critical component of AI engineering, enabling AI engineers to effectively handle and prepare data for use in AI models. Without a solid understanding of data science principles, AI engineers cannot ensure that the data used to train models is of high quality, relevant, and unbiased.

Here’s a detailed breakdown of the essential data science skills for AI engineers, based on the sources:
- Data Cleaning and Preprocessing: This is the foundational step in any data science workflow. It involves:
- Identifying and handling missing data. This includes understanding the mechanisms behind missing data (e.g., missing at random) to decide whether to impute the missing data, drop it, or use other techniques to fill in the missing values.
- Identifying and handling outliers or anomalies in the data, using statistical and other techniques to either remove or adjust these values.
- Data normalization, which involves transforming data to a standard scale to improve model performance.
- Data Visualization: This involves using tools and techniques to visualize data, which is crucial for identifying patterns, trends, and outliers. This skill is essential to tell a story about the data and is a necessary step before model development. Tools like Python with libraries such as Seaborn and Matplotlib are often used for this purpose.
- Feature Engineering: This is the process of creating new variables or features from the existing data. It involves combining multiple variables to engineer a single more informative feature. This skill is important because the quality of features can significantly impact the performance of AI models.
- Data Preparation Cycle: AI engineers must be able to follow the full cycle of data preparation, data evaluation, and use the data as an input for machine learning, deep learning, or generative AI models. This requires being able to:
- Source and collect data, as AI engineers may need to collect data when not working with data scientists.
- Filter and group data to prepare it for modeling.
- Split data into training, testing, and validation sets.
- Ethical Considerations:
- AI engineers must also ensure that their data is unbiased, addressing ethical considerations when using data in models.
The sources emphasize that without data science skills, even the most advanced AI models are likely to perform poorly because of the “garbage in, garbage out” principle. Therefore, a solid grasp of data science is essential for any aspiring AI engineer.

Essential Machine Learning for AI Engineers

Machine learning is a crucial skill set for AI engineers, and it is essential to master traditional machine learning before moving on to more advanced topics like deep learning. A strong understanding of machine learning is needed to effectively solve real-world problems and to make informed decisions about the most suitable models for a given task.

Here’s a breakdown of essential aspects of machine learning for AI engineers, based on the sources:
- Fundamental Concepts: AI engineers need to understand the core concepts of machine learning, including:
- Classification and regression problems.
- Supervised learning, where models are trained on labeled data.
- Unsupervised learning, where models are trained on unlabeled data.
- Algorithms: AI engineers must be familiar with various machine learning algorithms, such as:
- Linear regression and logistic regression.
- Decision trees.
- Ensemble methods like bagging, boosting, XGBoost, and LightGBM.
- Unsupervised models like K-means clustering, hierarchical clustering, and DBSCAN.
- Model Selection: AI engineers should be able to quickly determine the type of problem they are addressing (classification, regression, or unsupervised learning) and select appropriate algorithms. This involves understanding the strengths and weaknesses of different models and their suitability for specific types of data. For example, some models are more stable when dealing with missing data, while others work better with data that follows a normal distribution.
- Model Evaluation: It is critical for AI engineers to understand how to evaluate machine learning models, including:
- Understanding the training, testing, and validation cycle.
- Knowing different sampling and resampling techniques like bootstrapping and cross-validation (k-fold and leave-one-out cross-validation).
- Selecting appropriate evaluation metrics based on the specific problem. For example, using mean absolute error or mean squared error for regression problems, and metrics like F1 score or F-beta score for classification problems. It is important to understand when to prioritize precision or recall when evaluating a model.
- Practical Considerations: AI engineers must also know when to apply machine learning versus rule-based approaches. This involves understanding the context of the problem and the trade-offs between different approaches.
The sources emphasize that understanding the mathematics and statistics behind these algorithms is as important as knowing how to use them. In addition, a deep understanding of traditional machine learning is necessary before moving on to deep learning and advanced AI topics. This foundational knowledge allows AI engineers to solve problems efficiently and to understand the implications of their modeling choices from a business and enterprise perspective.

Deep Learning for AI Engineers

Deep learning is a critical area of study for AI engineers, forming the basis of many modern artificial intelligence applications, especially generative AI. Deep learning can be considered a more advanced form of machine learning, where models learn better with larger amounts of data.

Here’s a breakdown of key aspects of deep learning for AI engineers:
- Core Concepts: AI engineers must understand the fundamental concepts of deep learning, including:
- How deep learning differs from traditional machine learning.
- The architecture of neural networks and how they function, including the concept of neurons and perceptrons.
- The role and types of activation functions, and how they affect neural network performance.
- The importance of hidden layers, input layers, and output layers in neural networks.
- Training Process: A thorough understanding of how neural networks are trained is crucial, including:
- The concept of forward pass and backward pass.
- The backpropagation algorithm and how it optimizes the network.
- The role of loss functions in evaluating the network’s performance.
- Different optimization algorithms, such as gradient descent, stochastic gradient descent, RMSprop, Momentum SGD, and Adam/AdamW.
- Challenges in Training: AI engineers must also understand and address common challenges in training neural networks:
- The vanishing and exploding gradient problems and techniques to mitigate them.
- Techniques to combat overfitting, such as dropout, L1 regularization, and L2 regularization.
- Advanced Techniques: AI engineers need to be familiar with advanced deep learning techniques:
- Batch normalization and layer normalization, and the differences between them.
- Residual connections.
- Gradient clipping and Xavier initialization.
- Mini-batch gradient descent and its advantages.
- Types of Neural Networks: A key part of deep learning is understanding different types of neural network architectures, including:
- ANNs (Artificial Neural Networks), as well as the difference between discriminative and generative models.
- CNNs (Convolutional Neural Networks), and their applications, such as computer vision.
- RNNs (Recurrent Neural Networks), GRUs (Gated Recurrent Units), and LSTMs (Long Short-Term Memory networks), understanding their differences, applications, and limitations.
- GNNs (Graph Neural Networks).
- Generative Adversarial Networks (GANs).
- Autoencoders, and their use as non-linear counterparts to PCA.
- Practical Implementation: Besides theoretical knowledge, AI engineers must know how to implement deep learning models in practice. This involves:
- Using programming languages like Python.
- Using AI frameworks like PyTorch and TensorFlow.
- Understanding basic data structures and algorithms in Python.
- Knowing how to train and deploy machine learning and deep learning models using Python.
The sources emphasize that a strong foundation in deep learning is essential for working with modern AI applications, especially in generative AI. This includes not only understanding the theory behind neural networks but also knowing how to apply them in real-world scenarios using practical tools and techniques.

AI Engineer Roadmap – How to Learn AI in 2025

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 13, 2025