Amjad Izhar Blog

Category: Deep Learning

AI-Powered $1,000/Day: Beginner’s Free Income System
The provided video transcript outlines a method for individuals, even beginners, to potentially earn over $1,000 daily by leveraging free AI tools, specifically highlighting DeepSeek. The speaker emphasizes a simple, cost-free approach centered around using AI to generate content that promotes affiliate links for various tools, including over 400 listed in a free checklist offered to viewers who engage with the video. This strategy focuses on identifying trending content, using AI to recreate it, and distributing it across platforms to drive traffic to affiliate offers, often involving free trials and giveaways. The creator advocates for a shift towards passive, recurring income through this model, contrasting it with the burdens of traditional selling and emphasizing the simplicity and accessibility of using AI for income generation without needing technical expertise or significant upfront investment.

Study Guide: Earning with AI and Affiliate Marketing

Key Concepts and Topics:
- DeepSeek: A free AI tool, presented as an alternative to ChatGPT, used for content creation.
- Affiliate Marketing: Earning commissions by promoting other companies’ products or services through unique affiliate links.
- Passive Income: Generating income that requires minimal ongoing effort after the initial setup.
- Reoccurring Commissions: Earning continuous payments from customers who maintain subscriptions to promoted products or services.
- Trending Content: Identifying popular topics and content formats to maximize reach and engagement.
- Prompting AI: Effectively instructing AI tools to generate desired content.
- Value Provision: Offering free resources (like trials and giveaways) to attract potential customers without direct selling.
- Lead Generation: Attracting potential customers who show interest in the promoted products or services.
- Sales Funnel (Mentioned): A system designed to guide potential customers through the process of learning about and purchasing a product (though the focus here is on a simpler approach).
- Free Tools and Trials: Leveraging no-cost resources and limited-time access to paid tools to offer value and encourage sign-ups.
- Simplicity (KISS Principle): Emphasizing a straightforward approach to online earning, avoiding overly complex strategies.
- Content Regeneration: Using AI to create new content inspired by existing popular content.
- Outliers (YouTube Analytics): Videos with a significantly higher initial growth rate compared to their longer-term performance, indicating potential for broad appeal.
Quiz:
1. What is DeepSeek, and how is it suggested to be used for earning money online in the provided text?
2. Explain the concept of affiliate marketing as described in the source material, and what is the key benefit highlighted for the promoter?
3. According to the text, what is the primary strategy for getting traffic to affiliate links without spending money on advertising or engaging in complex tactics?
4. What role does identifying “trending content” play in the proposed method of earning with AI? Provide an example of a tool mentioned for finding trending topics.
5. Describe the “secret” to making money online with AI, according to the speaker, and why is it considered important?
6. Explain the concept of reoccurring commissions and why they are presented as a desirable form of income.
7. How does offering free trials and participating in giveaways benefit both the potential customer and the affiliate marketer in this model?
8. What is the speaker’s perspective on the complexity often associated with making money online, and what principle does he advocate for instead?
9. Summarize the eight-day AI challenge mentioned in the text and how individuals can access it.
10. How has the speaker’s personal approach to online income evolved, and what are the key differences between his past and present methods?
Quiz Answer Key:
1. DeepSeek is presented as a free AI tool similar to ChatGPT. It is suggested to be used for creating various forms of content (videos, blogs, social media posts) to promote affiliate links.
2. Affiliate marketing, as described, involves earning commissions by sharing unique links to other companies’ products or services. The key benefit for the promoter is that they don’t have to handle product creation, fulfillment, or customer service.
3. The primary strategy for getting traffic without paid ads is to find trending content and use AI (like DeepSeek) to regenerate similar content, which can then include affiliate links and attract organic views on various platforms.
4. Identifying trending content helps affiliate marketers tap into topics that are already popular and being searched for by a large audience. Google Trends is mentioned as one free tool for discovering trending topics.
5. The “secret” to making money online with AI is being good at prompting the AI. This involves effectively instructing the AI to regenerate content similar to what is already performing well online, leading to more views and leads.
6. Reoccurring commissions are continuous payments earned every time a referred customer pays for a subscription-based product or service. They are desirable because they can create a stable and passive income stream over time.
7. Offering free trials and participating in giveaways provides value to potential customers by giving them access to tools or a chance to win prizes without immediate cost. For the affiliate marketer, this can attract a larger audience and increase the likelihood of long-term paid subscriptions, leading to commissions.
8. The speaker believes that many online earning strategies are unnecessarily complex and advocates for simplicity, following the KISS (Keep It Simple Stupid) principle. He argues that focusing on simple, easy-to-understand methods is more effective.
9. The eight-day AI challenge is a step-by-step live series (with no editing) that demonstrates how to get started with using AI for online earning. Individuals can access the checklist and the challenge by going to shinify.com.
10. The speaker initially made a large amount of money selling his own products but found it stressful due to overhead and customer management. He now focuses on promoting other companies’ tools with free trials, generating reoccurring and passive income with significantly less personal involvement.
Essay Format Questions:
1. Discuss the advantages and potential disadvantages of using a strategy focused on promoting free trials and giveaways for building a sustainable online income through affiliate marketing, as described in the source material.
2. Analyze the role of AI, specifically tools like DeepSeek, in the content creation and distribution process outlined in the text for earning affiliate commissions.
3. Evaluate the claim that identifying and regenerating trending content is the “real secret” to making money online with AI. What other factors might contribute to success in this model?
4. Compare and contrast the traditional approach of selling products or courses online with the model presented in the text, which emphasizes offering free value and earning reoccurring commissions through affiliate partnerships.
5. Based on the information provided, outline a hypothetical step-by-step plan for a beginner to start earning money online using the methods and tools discussed in the “01.pdf” excerpts.
Glossary of Key Terms:
- AI (Artificial Intelligence): The theory and development of computer systems able to perform tasks that normally require human intelligence, such as learning, problem-solving, and decision-making.
- Affiliate Link: A unique URL assigned to an affiliate marketer that tracks the customers they refer to a business’s product or service.
- Algorithm: A set of rules or instructions that a computer follows to solve a problem or perform a task, often used by social media platforms to determine which content to show users.
- Commission: A fee or percentage of a sale paid to an affiliate marketer for successfully referring a customer.
- Content: Information or creative material, such as videos, blog posts, images, or audio, shared online.
- CRM (Customer Relationship Management): A system used to manage interactions with current and potential customers.
- DeepSeek: A specific free AI tool mentioned in the text, used for generating text and other forms of content.
- Fulfillment: The process of preparing and delivering a product or service to a customer after a sale.
- Lead: A potential customer who has shown interest in a product or service.
- Outlier (in Analytics): A data point that significantly deviates from other data points, often indicating exceptional performance.
- Passive Income: Earnings derived from an endeavor in which the earner is not actively involved.
- Prompting: The act of providing text instructions or questions to an AI model to generate a desired output.
- Reoccurring Income: Income that is earned repeatedly over time, often from subscriptions or ongoing services.
- Trending: Currently popular or widely discussed topics or content.
- Trial (Free Trial): A period during which a customer can use a product or service without payment, often with the expectation that they will subscribe or purchase afterward.
DeepSeek AI Affiliate Income: A Beginner’s Guide

Briefing Document: Making Money Online with AI Using DeepSeek

Date: October 26, 2023 (Based on content references) Source: Excerpts from “01.pdf” – A video transcript by Chase with Shinify

Overview:

This document summarizes the main themes and actionable strategies presented in a video by Chase from Shinify, focusing on how beginners can earn over $1,000 a day using the free AI tool DeepSeek. The core idea revolves around leveraging AI to create content that promotes affiliate links for various tools and services, primarily within the AI niche, without relying on paid advertising, complex funnels, or extensive technical expertise. The emphasis is on simplicity, utilizing free resources, and tapping into trending content to drive traffic and generate recurring income through affiliate commissions.

Main Themes and Important Ideas:
1. Simple and Free Method for Earning Online with AI:
- The presenter emphasizes that the method described is straightforward and doesn’t require any upfront investment or advanced technical skills.
- It avoids common complexities like paid ads, TikTok dances, or intricate marketing funnels.
- The core tools involved are primarily free, such as DeepSeek (a free alternative to ChatGPT).
- The goal is to create an “automated system where it works for you day and night” by promoting valuable tools that offer recurring commissions.
1. Leveraging DeepSeek for Content Creation:
- DeepSeek is presented as a central tool for generating content.
- Users can prompt DeepSeek to help create various forms of content, including videos, images, blogs, and podcasts.
- The strategy involves finding trending content and using DeepSeek to “regenerate trending content” for promotional purposes.
- The focus is on creating “simple very very simple content” that can attract views and leads.
1. Affiliate Marketing as the Monetization Strategy:
- The method relies on promoting affiliate links for various tools and services.
- The presenter provides access to a “free step-by-step checklist” containing over 400 tools with affiliate programs.
- He showcases personal earnings as proof, including one instance of being owed “$5,799” and another where he made “$93,000 in paid commissions” in the last three months from a single tool.
- The focus is on promoting tools with “reoccurring products reoccurring tools” to build a sustainable passive income stream.
- The presenter highlights that almost any company, including major brands like Nike, has affiliate programs, offering diverse promotional opportunities.
1. Capitalizing on Trending Content:
- The key to driving traffic without paid ads is to identify and create content around trending topics.
- Tools like Google Trends are suggested for finding popular search terms and emerging trends related to AI tools or any other niche.
- The strategy involves finding content that is already performing well (e.g., videos with thousands of views and comments) and using AI to “regenerate that content” in a similar vein.
- The presenter argues that platforms naturally distribute content related to trending topics, increasing visibility even for new or low-profile users.
1. Simplicity and Avoiding Complexity:
- The presenter repeatedly stresses the importance of simplicity (“KISS – keep it simple stupid”).
- He criticizes the selling of complexity and encourages beginners to focus on understanding and teaching a few core technologies.
- The 8-day live series within the checklist is designed to be a simple, step-by-step guide without editing or hidden steps.
1. Providing Value Through Free Resources:
- The strategy emphasizes offering free value to potential customers, such as access to the checklist, free trials of tools, and information about giveaways.
- The presenter himself operates on a model of providing free education and resources, stating, “everything I do is 100% for free and I don’t charge you money to teach you how to do what I do why because I already make enough money and I don’t need to sell you anything.”
- Promoting free trials and giveaways offered by companies is presented as a “win-win” situation where users get access to valuable resources and the promoter earns commissions if those users become paying customers later.
1. Long-Term Recurring Income vs. Short-Term Gains:
- The presenter advocates for building a system that generates recurring monthly income rather than focusing solely on immediate, one-time profits.
- He contrasts his current lifestyle of consistent passive income with a past period of higher earnings but also higher stress and overhead from actively selling products.
- The goal is to create a “reoccurring profit system” with multiple tools and promotions running concurrently.
1. Democratizing Access to Technology:
- The presenter aims to help individuals, even those who are not tech-savvy or who might be “scared of technology,” to adopt and benefit from AI tools.
- He emphasizes that AI is making things easier and can help overcome past roadblocks and limitations.
- By learning basic AI prompting and understanding a few tools, individuals can become qualified to help others adopt this technology and earn income in the process.
Key Quotes:
- “today we’re going to be talking about the most simple way to earn over $1,000 a day with AI and the best part about all of this is you don’t have to spend any money and you don’t have to be an expert.”
- “we’re going to be using a free tool called DeepSeek if you’ve never seen it before it’s basically like a free version of Chat GPT…”
- “…inside of this free checklist that you’re going to get when you drop a comment leave a like and subscribe you’re going to get over 400 tools that all have links that you can start promoting and you can get paid every single month by these different companies.”
- “you don’t have to do any of the fulfillment so inside of this free checklist that you’re going to get when you drop a comment leave a like and subscribe you’re going to get over 400 tools that all have links that you can start promoting and you can get paid every single month by these different companies.”
- “…we want to set up for you is an automated system where it works for you day and night you don’t have to worry about the customers you don’t want to have to worry about the fulfillment you don’t want to have to worry about any of that…”
- “the real secret to making money online with AI is just being good at prompting you just have to be good at at finding content that works well and then prompt the AI to give you a good output that helps you create content that’s similar to the thing that you saw that was already working well.”
- “AI is going to go and create that viral content for you as long as you know how to prompt it correctly.”
- “you don’t have to be this person with you know thousands of followers or be this big Instagram or YouTube influencer you don’t need that you could literally have just a basic average Facebook profile with a few friends on it start posting content on that that’s trending that’s about something that’s blowing up right now and the algorithms will naturally want to distribute your content to people just because you are talking about something that people want to see.”
- “the beauty of this system is that you can go out and give away free stuff right i’m not selling anything i’m just giving away free trials to things and if people choose to keep those things and they want to pay for them 30 days later then I make a commission okay and so you don’t have to go and sell your friends on anything you don’t have to go and say ‘Oh you need to buy my course or buy my program or any of that.’ All you’re doing is you’re helping people get free things…”
- “…you want the real deal you want real reoccurring passive income that comes in every single month whether you’re out on the beach whether you’re out hanging out with your family whether you’re playing video games whatever you’re doing you want to be able to have money coming in every single month…”
- “AI is not making things more difficult it’s making things easier…”
- “the real winners the true uh rich and wealthy people they focus on simplicity it’s the term Kisss KISS keep it simple stupid…”
Actionable Steps for Beginners (Implied):
1. Access the Free Checklist: Drop a comment, like, and subscribe to the video or visit shinify.com to get access to the list of over 400 tools with affiliate programs.
2. Explore DeepSeek: Sign up for a free account on DeepSeek and familiarize yourself with its basic functionalities (chat, deep think, search).
3. Watch Day One of the Checklist Series: Learn how to find and sign up for affiliate programs and obtain affiliate links.
4. Identify Trending Content: Use tools like Google Trends to discover popular topics, particularly within the AI niche or areas of interest.
5. Research Successful Content: Look for posts, videos, etc., that have high engagement (views, comments) related to trending topics.
6. Use DeepSeek to Regenerate Content: Prompt DeepSeek to create similar content based on the successful examples you found. This could be adapting the topic, angle, or format.
7. Share Content with Affiliate Links: Distribute the AI-generated content on relevant platforms (Facebook, YouTube, etc.), incorporating your affiliate links.
8. Focus on Providing Value: Emphasize the free resources, trials, and giveaways associated with the tools you promote.
9. Build a Long-Term System: Continuously identify new trending topics and tools to promote, aiming for a diversified portfolio of recurring income streams.
10. Embrace Simplicity: Avoid getting overwhelmed by the vast number of tools and focus on mastering a few key strategies and technologies.
Conclusion:

The video presents a compelling and seemingly accessible method for beginners to generate online income using the free AI tool DeepSeek and affiliate marketing. The core strategy revolves around leveraging AI to efficiently create content based on trending topics and promoting valuable, often free-to-try, tools that offer recurring affiliate commissions. The emphasis on simplicity, free resources, and providing value to others positions this approach as a potentially sustainable and less stressful alternative to traditional online business models. However, as with any income-generating opportunity, individual results may vary, and consistent effort in identifying trends and creating engaging content is likely necessary for success.

AI Affiliate Earnings: $1000/Day with Free Tools

Frequently Asked Questions about Earning with AI and Affiliate Marketing

1. What is the core method for earning $1,000 a day as described? The core method involves using free AI tools, specifically DeepSeek (a free alternative to ChatGPT), in conjunction with other free resources to promote affiliate links for various tools and services. The strategy focuses on identifying trending content, using AI to regenerate similar content, and distributing it to attract clicks on affiliate links that lead to recurring subscription sales. The emphasis is on automation and not requiring paid advertising, complex funnels initially, or extensive technical skills.

2. How does DeepSeek AI fit into this process? DeepSeek AI is used as a free content creation tool. It can help regenerate trending content ideas for various platforms (videos, blogs, social media posts, etc.) based on user prompts. This allows individuals to quickly create content relevant to popular topics without significant effort or cost. While the basic chat function is used, the “deep think” mode is mentioned as potentially providing better outputs.

3. What is the role of affiliate marketing in this system? Affiliate marketing is the monetization strategy. Individuals sign up for affiliate programs of various tools and companies (over 400 are mentioned in a free checklist). They receive unique affiliate links for these products. By creating content around these tools and encouraging people to click on their links, they earn commissions when someone subscribes or purchases the promoted product or service. The focus is on promoting subscription-based services to generate recurring monthly income.

4. Is prior experience or a large following required to get started? No, prior experience or a large existing online following is not required. The method is presented as beginner-friendly, with individuals of all ages (including those over 50, 60, and 70) reportedly earning money. The emphasis is on finding trending topics and using AI to create content, which can gain traction even without a significant existing audience. Starting with a basic social media profile is suggested as sufficient.

5. How is trending content identified and utilized? Trending content can be identified using free tools like Google Trends, which allows users to see popular search terms and topics. Once a trending topic relevant to AI or other promotable tools is found, AI (like DeepSeek) is used to help regenerate content similar to what is already performing well. The idea is to tap into existing interest and search volumes to gain visibility and clicks on affiliate links. Tools that analyze YouTube for outlier videos (videos with unexpectedly high early performance) are also mentioned as resources for finding successful content ideas.

6. What kind of products or services are typically promoted using this method? The focus is on promoting tools and services that offer affiliate programs, particularly those with recurring commissions. Examples mentioned include AI video creation tools, image editing software, writing assistants, and even broader affiliate programs like Nike and Amazon. The free checklist reportedly contains over 400 such tools across various categories. The strategy also includes promoting free trials and even giveaways offered by these companies.

7. What is the significance of the free checklist and how can it be accessed? The free checklist contains over 400 tools with affiliate programs. It also includes an 8-day live series (available as recordings) that provides a step-by-step guide on how to implement this earning strategy. Access to the checklist is typically offered by leaving a comment, liking, and subscribing to the creator’s content. It is also mentioned that it can be accessed by visiting a specific website (shinify.com) and providing a name and email address.

8. What is the long-term vision and mindset behind this approach to earning online? The long-term vision is to build a system that generates recurring passive income, allowing for greater financial freedom and flexibility. The mindset emphasizes simplicity (KISS principle), continuous learning and adoption of new AI technologies, and helping others by connecting them with valuable (often free) tools and resources. The goal is to move away from the stress of actively selling and towards a model where providing value leads to sustainable income through affiliate commissions on recurring subscriptions.

AI & Affiliate Marketing: Generating Passive Income

Making money online, according to the information in the source “01.pdf”, can be achieved through a simple method that leverages free AI tools like DeepSeek and affiliate marketing. This approach doesn’t require significant technical skills or financial investment.

The core of this method involves the following steps:
- Identifying Affiliate Products: The source mentions a checklist with over 400 tools that offer affiliate programs, allowing you to get paid to promote them. Importantly, it highlights that almost every company, including major brands like Nike and Amazon, has affiliate programs. These programs provide you with a unique link, and you earn a commission when people sign up or purchase through your link. The commissions can be recurring, meaning you get paid monthly as long as the customer remains a subscriber.
- Finding Trending Content: To get visibility, the strategy focuses on finding content that is already popular or “trending” on platforms like Facebook and YouTube. Tools like Google Trends can be used to identify trending topics related to your chosen affiliate products, such as “AI tools”. Additionally, tools that analyze YouTube data can help identify “outlier” videos that have grown rapidly, indicating popular content.
- Regenerating Content with AI: Once a trending topic or successful piece of content is identified, DeepSeek, a free AI tool similar to ChatGPT, is used to help regenerate similar content. This AI-generated content can be in various formats, such as videos, images, blog posts, or podcasts. The key to success here is effective prompting of the AI to get a relevant and engaging output.
- Distributing Content with Affiliate Links: The generated content, containing your affiliate links, is then distributed on relevant online platforms. The source suggests starting with platforms like Facebook and YouTube, especially for reaching an older audience interested in AI tools for automation. The platforms are more likely to distribute content that aligns with trending topics.
- Providing Value and Free Resources: A crucial aspect of this strategy is to offer value to the audience, often in the form of free trials or giveaways associated with the affiliate products. Many companies offer free trials of their tools and may even provide additional incentives like giveaways to encourage sign-ups. By promoting these free resources, you help people discover valuable tools without requiring them to make an immediate purchase. If these users later decide to subscribe to the paid version, you earn a recurring commission.
- Building a Sustainable, Passive Income: The focus of this method is on building a system that generates recurring and passive income. By promoting subscription-based tools and focusing on providing free value, you can create a revenue stream that continues to generate income even when you are not actively working. This is presented as a contrast to business models that require constant selling and active management.
The creator of this method emphasizes the simplicity and accessibility of this approach. They highlight that you don’t need to be a tech expert or have a large online following to get started. The key is to learn basic AI prompting and understand how to connect people with valuable, often free, resources through affiliate links. The success stories shared in the source, including individuals of various age groups earning money, aim to demonstrate the potential of this method.

In essence, the strategy revolves around leveraging AI to create content around trending topics, which then directs people to free trials and giveaways of useful tools through affiliate links, ultimately generating recurring commission income. This model prioritizes providing value to the audience and building a long-term, passive income stream over immediate sales.

AI-Powered Affiliate Marketing: Simple Online Income

Based on the information in the source “01.pdf”, using AI tools is presented as a simple and free method to earn money online, particularly through affiliate marketing. The source heavily emphasizes the role of DeepSeek, described as a free alternative to ChatGPT, in this process.

Here’s a breakdown of how AI tools are used according to the source:
- Content Creation and Regeneration: The primary application of AI tools like DeepSeek is to regenerate content. This content can take various forms, including videos, images, blog posts, and podcasts. The strategy involves finding content that is already trending or popular on platforms like Facebook and YouTube and then using DeepSeek to create similar content. The effectiveness of this approach hinges on good AI prompting to obtain relevant and engaging outputs.
- Identifying Trending Topics (Indirectly): While tools like Google Trends are used to find trending topics directly, AI plays an indirect role by enabling the user to quickly create content around these trends once identified. Additionally, AI can be used to analyze successful content (e.g., YouTube videos with high outlier scores) and help regenerate similar formats and themes.
- Content Distribution (General Mention): The source mentions that AI can help in distributing content (“it’ll find the people it’ll distribute your content”), although it doesn’t provide specific details on how this occurs within the described strategy. The focus seems to be on leveraging the algorithms of platforms like Facebook and YouTube by creating content around trending topics, which these platforms are more likely to distribute.
- Learning and Teaching: AI is also portrayed as a tool that simplifies the learning process for online money-making and can even assist in teaching others. According to the source, AI can provide instructions, suggest what to promote, and help create content for emails and videos. This makes it easier for beginners, even those who are not tech-savvy, to understand and implement the described affiliate marketing method. The emphasis is on AI making things easier rather than more complex.
- Image Generation (Specific Example): The source provides a specific example of using AI for generating thumbnails for YouTube videos. The creator used DeepSeek to help regenerate a prompt for a thumbnail similar to a successful video they had seen, and then used another AI tool to create an image based on that prompt.
In essence, the strategy outlined in the source leverages free AI tools like DeepSeek to efficiently create content based on proven trends, making it easier to attract an audience and promote affiliate products. The focus is on simplicity and accessibility, with AI handling much of the content creation process. The source suggests that by mastering basic AI prompting, individuals can tap into the potential of trending topics and provide value (often free resources) to others, ultimately leading to passive income through affiliate commissions.

AI-Powered Affiliate Marketing: Earn with Trending Content

Affiliate marketing is a key strategy discussed in the source “01.pdf” as a simple way to earn money online by getting paid to promote products. The source emphasizes that almost every company, including major brands like Nike and Amazon, has affiliate programs.

Here’s a breakdown of affiliate marketing as described in the source:
- How it works: Companies provide you with a unique affiliate link. When people click on your link and either sign up for a service or purchase a product, you earn a commission.
- Types of affiliate programs: The source mentions a checklist with over 400 tools that offer affiliate programs. These tools cover various categories like video, image, and writing tools. Importantly, it also highlights that you can promote physical products from companies like Nike and Amazon. For instance, Nike’s affiliate program allows you to earn up to 15% on all valid US sales of Nike products.
- Recurring commissions: A significant advantage of affiliate marketing, as highlighted in the source, is the potential for recurring commissions. By promoting subscription-based tools, you can get paid every single month as long as the customer you referred remains a subscriber. The creator of the method shares examples of earning recurring income from various tools.
- The role of AI in affiliate marketing: The core of the method described in the source involves using free AI tools like DeepSeek to create content around trending topics and embedding affiliate links within that content. The AI helps in regenerating content such as videos, images, blog posts, or podcasts. The idea is to leverage trending content to attract an audience and then direct them to affiliate offers.
- Finding affiliate products: The provided checklist of over 400 tools is presented as a resource for finding affiliate programs. The source also advises exploring affiliate programs offered by well-known brands in various niches.
- Generating sales without being a tech expert or spending money on ads: The source stresses that this approach doesn’t require significant technical skills or financial investment in paid advertising. The focus is on finding trending content and using AI to create similar content, which platforms like Facebook and YouTube are more likely to distribute organically.
- Providing value and free resources: A key element of the strategy is to offer value to the audience by promoting free trials and giveaways associated with affiliate products. Many companies offer free trials as an incentive for users to try their tools. By promoting these free offers, you can encourage sign-ups, and if those users later convert to paying customers, you earn a commission. The creator shares an example of a tool offering a 14-day free trial and a chance to win a trip to LA, both of which can be promoted through an affiliate link.
- Building a passive income stream: The ultimate goal of this affiliate marketing strategy is to build an automated system that generates recurring and passive income. Once the system is set up and people are subscribing to the tools you promote, you can earn money consistently without needing to actively manage the customers or the fulfillment process. The creator contrasts this with business models that require constant active selling.
- Simplicity and accessibility: The source emphasizes the simplicity of this affiliate marketing method, stating that it’s accessible even to beginners and those who are not tech-savvy. The key is to learn basic AI prompting and connect people with valuable resources through affiliate links.
The creator of this method shares personal experiences of earning significant income through affiliate marketing and highlights success stories from others in their community, including individuals of various age groups. The focus is on a win-win model where you help people discover valuable (often free) tools, and in return, you earn commissions if they become paying subscribers.

Affiliate Marketing: Leveraging Free Value Giveaways

Based on the information in the source “01.pdf”, free value giveaways play a significant role in the affiliate marketing strategy described as a simple way to make money online.

Here’s a breakdown of the discussion around free value giveaways:
- Companies offer them to attract users: The source explicitly states that many companies provide free trials of their tools and may even offer additional incentives like giveaways to encourage people to try their platforms. The reasoning behind this is that they believe if people experience the value of their tool or platform, they are more likely to become paying subscribers eventually.
- Promotion as a core strategy: A crucial aspect of the described affiliate marketing method is to promote these free trials and giveaways associated with affiliate products. The creator emphasizes that instead of directly selling products, the focus is on helping people discover and access these free resources.
- Examples of free value: The source provides concrete examples of the types of free value being offered:
- Free trials of tools: Companies offer free access to their software or services for a limited period, such as a 14-day free trial of a video filtering tool called Spotter.
- Giveaways: Some companies run contests where users who sign up for a free trial or take a similar action are entered to win prizes. An example mentioned is a giveaway of a trip to Los Angeles with paid-for flight and hotel, offered by Spotter in conjunction with their 14-day free trial.
- Win-win situation: The strategy is framed as a win-win for everyone involved:
- The audience wins because they get access to valuable tools and the chance to win prizes for free, without any immediate obligation to purchase. They are essentially receiving a favor by being connected to these free resources.
- The affiliate marketer (you) wins because by offering free value, they can encourage more people to click on their affiliate links and sign up for trials. If these users later decide to pay for the tool, the affiliate marketer earns a recurring commission.
- The company wins because they gain new users and potential long-term customers without having to spend heavily on direct advertising. Giveaways, even significant ones, can be a cost-effective way for them to acquire customers compared to traditional advertising methods.
- Shifting the sales mindset: The source suggests that this approach allows individuals to make money online without constantly feeling like they are “selling” something. Instead, they are helping people by connecting them to valuable, often free, resources. This can be a more comfortable and sustainable approach for many people.
- Generating recurring income: The ultimate goal is to build a system where the promotion of these free resources leads to people becoming long-term paying subscribers of the affiliated tools, thus generating recurring and passive income for the affiliate marketer.
In summary, the strategy described in the source heavily leverages the power of free value giveaways, offered by companies, as a way to attract users and drive affiliate sign-ups. By focusing on providing free value rather than direct sales, individuals can build a sustainable online income stream based on recurring commissions.

AI-Driven Recurring Affiliate Income System

The source “01.pdf” extensively discusses a system aimed at generating recurring income through affiliate marketing, heavily leveraging AI tools and free value giveaways. This system focuses on building a sustainable income stream over time, rather than quick, one-time profits.

Here are the key aspects of this recurring income system as described in the source:
- Affiliate Marketing of Recurring Subscription Tools: The foundation of this system is promoting tools that offer recurring commissions. The source provides access to a checklist of over 400 tools that have affiliate programs, allowing you to earn monthly payments as long as the referred customer remains a subscriber. This contrasts with promoting one-time purchase products where you only earn a commission once. The emphasis is on building a portfolio of different recurring income streams from various tools.
- Leveraging Free AI Tools for Content Creation: A core component of the system is using free AI tools like DeepSeek (a free alternative to ChatGPT) to create content. This AI-generated content, such as videos, images, blog posts, and podcasts, is used to attract an audience to the affiliate links. The source stresses that this eliminates the need to spend money on content creation or be a tech expert. The key is to prompt the AI effectively to regenerate content that is likely to resonate with potential users.
- Focusing on Trending Content: The strategy involves identifying trending topics using tools like Google Trends and then using AI to create content around these trends. By tapping into what people are already searching for, the system aims to gain organic reach on platforms like Facebook and YouTube. These platforms are more likely to distribute content related to trending topics, increasing visibility without paid advertising.
- Promoting Free Value Giveaways: A crucial tactic within this system is to promote free trials and giveaways associated with affiliate products. Many companies offer free trials to encourage adoption of their tools. Additionally, some companies may offer special giveaways like trips or money to incentivize sign-ups through affiliate links. The strategy is to lead with value by offering something for free, making it easier to attract clicks on affiliate links. The source emphasizes that you are essentially helping people discover valuable resources for free.
- Organic Content Distribution: The system relies on the algorithms of platforms like Facebook and YouTube to distribute the AI-generated content organically. By creating content around trending topics, the likelihood of the platform showing it to interested users increases, reducing the need for paid advertising. The source suggests that even a basic social media profile can be used to start distributing this content.
- Automated System for Passive Income: The goal is to create an automated system where you are consistently generating leads and sign-ups for recurring subscription tools, leading to passive income. Once the system is set up and people are subscribing through your affiliate links, you earn money continuously without needing to actively manage customers or fulfillment. This provides a lifestyle with reduced overhead and the flexibility to take time off while still earning.
- Simplicity and Accessibility: The source repeatedly emphasizes the simplicity of this system, making it accessible to beginners and those who are not tech-savvy. The focus is on learning basic AI prompting and connecting people with valuable free resources through affiliate links.
In essence, the recurring income system described in the source is a multi-faceted approach that uses free AI tools to efficiently create content around trending topics, which is then distributed organically to attract people to free trials and giveaways of recurring subscription tools offered by companies with affiliate programs. This focus on providing free value aims to build a sustainable stream of passive, recurring income. The creator of this method contrasts this approach with models that require constant active selling or significant financial investment.

How I Make $1,000 a Day Using DeepSeek (Even if You’re a Beginner!)

The Original Text

all right what’s going on everyone welcome back chase with Shinify here and today we’re going to be talking about the most simple way to earn over $1,000 a day with AI and the best part about all of this is you don’t have to spend any money and you don’t have to be an expert you don’t have to dance on TikTok you don’t have to do anything complicated or you don’t have to learn anything that requires you to be super techsavvy in fact everything in today’s video is going to be very very simple and it comes with a free step-by-step checklist and if you want access to this all you have to do is drop a comment leave a like and subscribe and I will send you access to this right now without you having to spend any money or take out your wallet or do any of that because everything I do is 100% for free and I don’t charge you money to teach you how to do what I do why because I already make enough money and I don’t need to sell you anything so don’t worry i’m not just another guy out there going and trying to pitch you on something in this video now inside of today’s video I’m going to give you a very very simple process to get started and what we’re going to be doing is we’re going to be using a free tool called DeepSeek if you’ve never seen it before it’s basically like a free version of Chat GPT and we’re going to be pairing DeepSeek with a few other free tools to help us go and get sales on tools that we can get paid to promote without having to do any of the fulfillment so inside of this free checklist that you’re going to get when you drop a comment leave a like and subscribe you’re going to get over 400 tools that all have links that you can start promoting and you can get paid every single month by these different companies and if you don’t believe me let me show you a few of these companies that are paying me every single month this is one of them you can see they actually owe me $5,799 here and if I go to my payouts you can see they pay me every single month because I use AI to send these links to people and I get paid out every single month so I can go do whatever I want and that’s what we want to set up for you is an automated system where it works for you day and night you don’t have to worry about the customers you don’t want to have to worry about the fulfillment you don’t want to have to worry about any of that because once people are paying for these products you don’t have to do anything and I’m doing this with a bunch of different companies and by the way it’s not only me we have people in our group which by the way we have people of all ages in our group people over 50 people over 60 people over 70 earning money with what we’re talking about and you can click on our daily wins inside of our group because the link to our group our free group is in the description of this video and you can go see all the different people in here look at this we have Little Rock here who said “Okay is it thousands?” No but it’s money I made putting my affiliate links in without a funnel i’m showing for two reasons one this business makes money for anyone having doubts and two don’t just put your affiliate link here and there without a funnel landing page and so we’re going to show you by the way what all of this means if you’re brand new you don’t know what a funnel is but check this out little Rock here just got started with one of these links and they’re already earning $132.81 with this system okay and so you might not earn a ton of money right out of the gate that’s one thing I want to tell you as a disclaimer this isn’t like one of those get rich overnight things this is something where you start setting up your system you start getting people to click on your links and over time it starts to compound and you’ll even see with my own payments here that when I first started with this specific tool I actually wasn’t making that much i was making $77 on my first month okay so eventually though as you build up the reoccurring income and you start to diversify passive income between different tools that’s when you really start to see the power of this because you have money coming in every single month reoccurring because what we’re doing is we’re helping people get subscriptions to reoccurring products reoccurring tools because there’s a wide openen market right now for AI tools and we can get paid out every single month look this is another tool that paid me in the last 3 months I was able to make $93,000 in paid commissions just off of this tool and look at all these other tools that I’m sending traffic to so I could go on and show you all this proof again I’m just showing you this not to brag or anything i just want you to see that it’s possible and I want you to see that this is real okay this is a live stream by the way i don’t hide anything i always tell people listen go and watch all of my live streams because what what I do is I actually put together AI challenges and you can go and follow along with these AI challenges in a live stream environment because inside of our checklist here this is an 8day live series no editing no screenshots nothing’s hidden okay so you can go see every single day for eight days straight what I do for step one what I do for step two what I do for step three and you can follow along step by step and that’s what this checklist is for is to make it very simple for you okay all right so what we’re going to be talking about again first of all is how to just get started with a basic AI product so what we’re going to do is we’re going to head over to DeepSeek and we’re going to grab a free account okay so we’re going to click on this link to get a free account you can sign in with your Gmail you can sign in with your uh any sort of email here i’ll just sign in with a random Gmail just so you can see what I’m doing and once I’m signed in it’s going to ask me for my birthday just to confirm that I’m uh allowed to use this tool and then I’ll be in okay so I’ve already done this before so there you go now I’m in now inside of DeepS very simple it has a a few different buttons here we have the chat so if we want to start a new thread new chat we can always go back to our other chats once we have one and then we have the deep think and then the search and then the attachment deepthink just gives you a better output for when you ask a question okay so uh you don’t have to use it but you can and uh we can start using this for free right now so what I can do is I can start having DeepSeek help me out with whatever I want to you know start doing and so we’re going to specifically be using DeepSeek to help us in creating content and here’s what you can do with this you can create content for whatever you want okay so whether you want to create a video whether you want to create an image whether you want to create a blog whether you want to create a a podcast we can create any type of content because by the way what we’re doing here is we’re we’re creating content and we’re not paying for anything right so you know there’s a lot of people out there that will say “Well you know you’re going to go get these links here and then you’re going to start running paid ads to these links.” No we’re not going to run any paid ads we’re not going to do any weird dances on TikTok we’re not going to do any of that all we’re going to do is we’re going to find trending content whether it’s on Facebook whether it’s on YouTube wherever we want to find the trending content and we’re going to start feeding it to our AI to this free tool here okay so check this out i’m going to click on deep think we’ll turn it off here just to get a basic response and I’m going to say I need help regenerating trending content now by the way when you go and start creating content you want to make sure you have a few things in place and so if you haven’t already go into the checklist and watch day one inside of day one I show you how to set up your affiliate links so if you don’t have your affiliate links yet to any of these products go watch day one because in there I break down step by step how to go and find a good affiliate product how to go and sign up for a free account how to grab your own link all of that so make sure you do watch day one but pretty much all the tools have links that they give you okay and they’re assigned to you so you can see this tool actually has a link right here that is assigned to me anybody who clicks on this link if I send this as a message or if I create content or whatever I’m doing to get people to click on this link it gets registered in my account here and if they turn into a customer I get a payment for that customer okay it’s a 40% uh reoccurring commission here and so getting the affiliate link it’s very simple but if you’ve never gotten an affiliate link before make sure you go watch day one because in day one I do show you how to go and pick your links and figure out which one’s right for you and you know there’s 400 tools on this list so you got to kind of figure out which ones you want to promote there’s there’s so many different things you can promote out there you can promote video tools image tools writing tools and uh I’ve said this before by the way for those of you who are new here but I’m going to say it again you don’t have to even promote online tools you could go and promote the brand Nike you could go promote Amazon products almost every company out there has an affiliate program that means they’ll give you a link and you can get paid to promote their products you could literally send your favorite shoes to somebody that are Nike shoes and you can earn a 15% commission on that and if you don’t believe me just go Google Nike affiliate program and you’ll see right here that they do a 15% affiliate look at this earn up to 15% on all valid US sales of Nike products okay so I know a lot of people like go “Well this is weird why would I go and sell other people’s products?” Because companies want you to sell their products they want people to go out and sell stuff for them and so if you can be the person that just finds good products and connects people to those products you make money on that and the best part about it is AI does all the heavy lifting ai will go and create all the content for you it’ll find the people it’ll distribute your content and it’s just an amazing thing and if you learn this you can go out and create a ton of content that goes and recommends products and you can use AI for this and you don’t have to be a tech expert you don’t have to have a supercomput you don’t have to do any of that right you don’t have to drive a Ferrari literally you just go into a simple free tool like the one you’re looking at right here and you start creating simple very very simple content and once you create that simple content you’ll be amazed how many views and how many leads you can start to get for these different products because AI is going to go and create that viral content for you as long as you know how to prompt it correctly okay and so the secret and and this is really what I want to show you today and and drive home the real secret to making money online with AI is just being good at prompting you just have to be good at at finding content that works well and then prompt the AI to give you a good output that helps you create content that’s similar to the thing that you saw that was already working well okay so if you see a post that has thousands of comments or you see a post that has thousands of views and and you know how to prompt the AI to go and regenerate that content for you well guess what now you’re getting thousands of views now you’re getting thousands of leads and if you don’t believe me again check out what I do check out what uh the people in the group do i I’m not the only example of this there’s so many people out there that are doing what I’m talking about right now and they’re getting tons and tons of views they’re getting tons and tons of leads and sales and they’re not experts by any means they just know how to go and prompt the AI so what we’re going to do is we’re going to start figuring out what topics we want to target now obviously if you’re going to be targeting AI you’re going to be or AI tools you’re going to be finding topics around AI tools now there are different tools there’s different free tools that allow you to go and find trending content and trending terms around AI and around anything that you’re looking for so Google Trends is one of them if I go to Google Trends here I can type in something like AI tools and I can click on explore and this will give me again for free here a bunch of different terms around AI tools that are popular right now so I can see Grock is a really popular term success database and so I can start diving into these topics more or I’ll show you how to do that in a second but this is how I can kind of see what people are actually looking for and what’s trending now if we can just find what’s trending what a ton of people are looking for we can use AI to help us create content around that trend and we can start grabbing people from that trend because people are looking for stuff every single day and when you find a trend that’s blowing up it’s usually easy to get views even if you don’t have a ton of followers even if you don’t have you know a bunch of money for ads or any of that these platforms want to distribute your content if you create content around things that are trending even if you aren’t somebody that’s ever really even posted online before so don’t think you have to be this person with you know thousands of followers or be this big Instagram or YouTube influencer you don’t need that you could literally have just a basic average Facebook profile with a few friends on it start posting content on that that’s trending that’s about something that’s blowing up right now and the algorithms will naturally want to distribute your content to people just because you are talking about something that people want to see and so the platforms know when something’s trending and they want to distribute content around those things because they want to keep people on the platform and so our job is to find that trending content and distribute it so what we’re going to do is we’re going to choose a topic let’s say it’s Grock here okay so we’re going to click on Grock now inside of Google Trends I can filter by the past 4 hours 7 days 30 days 90 days whatever I want and you can see Grock over time has been trending upward so it’s actually doing pretty well right now it’s not at its peak its peak was back in March 10 but it looks like it might go back up to to to um a 100 score up here again okay now I can see all the different things related to this Deep Seeks one you can see I’m making a video about DeepSeek but my point here is that you can go in here find these trending topics and now that you have a topic let’s just say it’s Grock or DeepSeek let’s say you were going to make a video about deepseek what you can do is then you can start doing research about what content is actually trending around that thing okay and I’m going to show you how to do that in a second here I have a comment right now in the live and they said “How do I get started with the checklist?” Okay so obviously like I said drop a comment leave a like and subscribe but you can go to shinify.com and you can go and grab the checklist just by entering your first name and email that’s all I ask of you and uh I do have a recommended tool after this it’s completely optional you don’t have to people always say “Oh well Chase you’re just pitching me on this or that.” You don’t have to go and get this okay you can go and do whatever you want right this is the CRM this is the automated tool I use for follow-up but it’s completely optional and again it’s just because if people want it it’s a 30-day free trial and just like I show you how to recommend tools and free trials and all the things that I’m showing you to make money I go out and I still recommend free tools and that’s the best part about this system and the beauty of this system is that you can go out and give away free stuff right i’m not selling anything i’m just giving away free trials to things and if people choose to keep those things and they want to pay for them 30 days later then I make a commission okay and so you don’t have to go and sell your friends on anything you don’t have to go and say “Oh you need to buy my course or buy my program or any of that.” All you’re doing is you’re helping people get free things and uh by the way a lot of these companies have free stuff on top of their free stuff what does that mean well when you go and grab a free trial to for example the tool I just showed you a second ago you get entered to win money uh there’s another tool let me actually show you this one if you go to the link in the description uh the tool that says spotter they’re doing another giveaway where they’re actually giving away a trip to California and let me show you what this affiliate link looks like because you can actually go and and promote this yourself so all these things that you can participate in you can also give away so you can literally enter a giveaway but then also participate in giving away the giveaway i don’t know i I don’t know if that makes sense but hopefully it does but check this out this is the link that they gave me and this link goes to a 14-day free trial to their tool and on top of the 14-day free trial they also are giving away the ability to win a trip to LA with a paid for flight paid for hotel and all you have to do is literally grab their trial you don’t even have to rebuild you don’t even have to pay for it right you could cancel the free trial before the 14 days are up you could use the tool for 14 days and then you could still win a trip okay so there’s companies that are doing this all the time because they want people to go to their tool or their company they want people to adopt their platform they they think that if if they give these incentives and people will end up signing up and paying for the tool eventually and it’s true it works right that’s how I make as much money as I do every month is literally just giving away free stuff and people can choose if they eventually want to pay for it or not okay so check this out if I go and log into this tool this tool is actually amazing what it does is it allows you to go and filter all of YouTube so what I can do here is I can go to the outliers i can click on outliers and this will show me all of the most popular videos in my space around what I what I talk about so AI and then I can actually use AI i can use DeepSeek to go and recreate this trending content and that’s what I do so I literally the other day saw this thumbnail here it says AI will retire you it has 422,000 views an outlier of 10.9x and what an outlier is is it’s basically the first seven days of growth to the video opposed to the first 6 months and then if you put those two together you can kind of see organically how well that that video does is it a video that just goes really and blows up for the first day or two and then dies off eventually or does it continually blow up over time and so if we can find something with a good outlier score we know that that’s probably going to be a video that does well for us okay and if I go to my YouTube channel check check this out i’ll go to my live and I did a similar video you can see the thumbnail is very similar and that video is doing very well right now okay I’ll give you another example the video you’re watching right now if you’re watching the live stream there was a video that I saw was doing ra well around AI and I said “Okay you know what i’m going to go and I’m going to take that video and I’m going to use AI to help me regenerate the different parts of that video that I want to create.” And so my thumbnail today guess what check this out if I go to replicate here which is my AI cloning tool it’s not mine it’s just one that I use but check this out the original thumbnail looked just like this let me see if I can go find the video but the original thumbnail looked just like this and then I had Deepseek help me regenerate a prompt for this thumbnail and I fed it to an AI tool which then cloned me and gave me my thumbnail and so inside of our AI challenge by the way we show you how to do all of this we show you how to go and create your own AI image clone we show you how to go and do the topic research we show you how to regenerate trending content not just in terms of video but any type of content whether you’re you know regenerating a Facebook post whether you’re regenerating uh you know a Tik Tok whether you’re regenerating a Instagram picture right like you can go and choose what platforms you want to target and ideally you target the platforms where your audience is okay so Tik Tok and Instagram are usually for younger people okay so if you’re looking to target a younger audience and you know let’s say you have a gaming channel those platforms are great if you’re going for an older audience YouTube and Facebook is generally a little bit better there’s more people that are older on those platforms but you usually want to choose one or two platforms okay and I recommend people starting out start on something like Facebook and YouTube just because I think it’s easier to make a sale uh you don’t need as many views there’s a lot of people on Facebook and YouTube that are looking for you know they’re looking for AI tools to help them automate what they do you know automate an online business and so this is just a massive open wide space right now for you to get into and by the way these 400 tools on this checklist are just a few this just a drop in the bucket i mean this there are so many other tools that you can go out and promote i actually built this probably about a year ago and I I would say that there’s probably four or 5 thousand of these now that you you could go out and promote and on top of this you can actually reach out to these companies and you can offer to promote these people these companies for for free right well you’re not really doing it for free because you’re still getting an affiliate link but you can even ask them and you can say “Hey listen i have people that are interested in your product would you be willing to do a free giveaway on top of this and and some of these companies will tell you yes they’ll say “Well yeah we’re actually giving away money or we’re giving away a trip to you know the Bahamas or we’re giving away a trip on a cruise.” And then you take those giveaways and you take the free tools and you create content around those things and you say “Hey listen audience or people that follow me or people that I’m friends with this company is giving away this free thing all you have to do is grab the free thing the free trial or free whatever that they’re they’re offering and you get entered to win and so everybody wins with this model i don’t think you understand that or or maybe you do but ideally we we want to create a system right where first of all we don’t have to do any heavy lifting we don’t have to you know go and do a bunch of fulfillment we don’t want to have to deal with a bunch of customers right that’s where the affiliate part comes in because the companies take care of all of it for us but on top of it we don’t want to have to uh worry about what we’re selling we want to just give away free stuff we don’t want to have to sell things to people and so what we can do is we can just give away stuff free value right so the people around us win because they’re they’re getting entered to win stuff and they’re getting all this free value but we win because if those people end up becoming long-term adopters of the things that we’re giving away we end up making money and so that’s a win-win for everybody and then the company wins as well because they get customers and they don’t really have to spend that much right if they give you something to give away and it only costs them you know a thousand bucks or 500 bucks that’s not that much for for big companies you know they’ll spend 10 or 20k on ads in a day so for you to go out and do a giveaway that lasts a month that only cost them a thousand bucks but now they have 500 new customers or 100 new customers everybody wins okay and so that’s why I want you to understand this model here because you don’t have to be the person that’s going out and selling all the time and that’s why I do this now i actually made more money by the way when I sold things but my life is better now why because I don’t have to worry about anything check this out i actually made in one month over $200,000 in fact if you put PayPal with how much I was making it was like 300 grand and I stopped selling around this point because I decided that my life was not fun i was making a lot of money but also I had all these employees i was worried about you know my customers i was worried about making sure the products were good i was worried about all this stuff all the time and now I only make let’s say maybe 50 to 100K a month which you might be saying well that’s only but I’m saying opposed to what I was making before but my overhead is virtually zero it’s reoccurring income it’s passive income i can choose to take next month off and still make just as much money and so it’s a completely different lifestyle and this is one of the things you want to be careful of when you start listening to people who say they make a lot of money they might tell you “Oh yeah I make all this money.” but in reality their life’s awful and they don’t have actual reoccurring passive income you want the real deal you want real reoccurring passive income that comes in every single month whether you’re out on the beach whether you’re out hanging out with your family whether you’re playing video games whatever you’re doing you want to be able to have money coming in every single month and the way you can do that is by what we’re talking about here it’s by learning AI it’s by learning a few different tools that by the way have free trials you don’t have to pay for them if you don’t want to you can try them out if you don’t make any money with them you can cancel the subscriptions on them but it’s to adopt technology and then it’s to take that technology and hand it to other people and say “Listen I use this tool to help me do this i use this tool to help me do that.” And then other people are going to need help with those things right think about how many people every single day manually respond to emails or they manually go and create posts on Instagram or they manually go and write things on Facebook there are so many people out there that have never even logged into an AI tool or into ChatGpt or DeepSeek or any of these tools and so all you have to do is learn a little bit about these things and then help other people adopt that technology because there’s so many people that have not adopted technology yet because they don’t understand it they’re scared of it they’re terrified and so you might be one of those people you might say “Well I’m scared of technology i’m scared of you know AI replacing me or I’m scared of not you know being the tech super genius that I need to be in order to learn these things.” And and and you don’t have to be that you know people sell complexity because they think it makes them look smart but it’s not it’s not something people who think things that are complex are smart are dumb the the real winners the true uh rich and wealthy people they focus on simplicity it’s the term Kisss KISS keep it simple stupid okay and so if you’re out there and you’re worried all day and you’re thinking I don’t know what to do i don’t know there’s so many things there’s so many tools the reason why you feel that way is because there’s so many people out there that are selling you complexity okay and so the whole idea is that you simplify what we’re what you’re doing and you just go out first of all you take the challenge right take the simple challenge it’s eight days it’s 1 hour a day you can do that you could do this in one day if you wanted to okay learn a few pieces of technology just a few you don’t have to learn all 400 tools learn a few pieces of technology once you understand it you now are qualified to go teach it other people whether it’s through a direct conversation whether it’s through you know posting in a group whether it’s through creating a video you are now qualified to go and help people with that thing because you know how it works and what form you do that in is up to you you could do it in a blog post you could do it in an email and by the way you can use AI to go and teach it for you you don’t even have to do it yourself but at the end of the day you want something that’s going to go and help people adopt this technology and you’re going to make money off of that and then on top of it you’re getting a lot of value for these people because not only are you teaching them something but you’re also connecting them to companies that have free giveaways free trials a bunch of free stuff and and and these companies are willing to frontload all this value because they’re willing to lose money on the front end to impress your people that you’re now helping right and you’re not going out and selling anything to them you’re just helping them with something for free you’re literally doing them a favor okay and that’s all you have to do is you go out every single day you help people out for free don’t have to sell anything you give them all this value and then you end up making money for it so it’s a win-win for everybody and it’s a it’s a business model that I think you’re going to start seeing more and more of in the future you don’t see a lot of it right now because people haven’t really learned it yet this is a new model that has been working really well for me i see it working well for a few other people but most people don’t really know about this yet because they haven’t adopted it yet and they don’t know that they can go out and not have to sell every day and and also you got to understand most people are so focused on making money right now this second that they would rather have $1,000 right now than $1,000 every month for the rest of their lives okay and so our job is to shift that mindset right a lot of people are probably watching this video they’re like “I need money right now.” I get that but would you rather have $1,000 right now or would you rather have $1,000 every month for the rest of your life and if it’s the latter if it’s the second thing then ideally what we need to do is set up the system for you okay okay we need to set up a reoccurring profit system for you that every single month you have different tools you have different promos you have different giveaways that you’re you’re putting on your calendar and you’re going out and you’re helping people with those things right this company’s now giving away this this company’s now giving away that you’re connecting people to those companies and you’re making money off it okay so again if you haven’t already and you want access to everything we’re talking about in today’s video make sure you drop a comment leave a like and subscribe mj Healthcare good to see you hello Chase i’m new here and love to learn awesome and I appreciate that super donation you did earlier thank you so much for that yeah so if you guys have any questions make sure you join our group um we are very active in there i’m very active in there uh you can tag me in there just doshinify in the public chat i’m almost in there every single day happy to help you out um but yeah that’s it go out and learn go out and start adopting this new technology and don’t worry about it don’t don’t don’t be scared of it okay i know a lot of people look at this stuff and they go “Oh this is so terrifying you know I I just I’ve never been good with computers.” This is the this is the opposite of what you’re thinking okay AI is not making things more difficult it’s making things easier and if you start learning it right you just learn basic prompting just going in and just typing and having me uh conversations with it you’ll learn that it’s actually making your life easier it’s telling you what to say it’s telling you what to sell it’s telling it literally gives you instructions if I say “I don’t know what to sell today i don’t know what to do.” AI is going to solve that problem for me it’s going to say “Well this is what you should do.” and I don’t know what email to send i don’t know how to sell through email i don’t know how to sell through video ai tells you how to do it okay so all these things that you thought were difficult are now becoming easier because of AI so don’t be overwhelmed by it adopt it and learn that it’s actually going to create an enhanced version of you it’s going to help eliminate all those things that you had as problems in the past okay if you can learn to adopt it that way and change your mind around it and not think about it as something that’s scary but something that’s going to actually enhance what you’re doing and who you are you’ll realize that the thing that you were struggling with before is no longer a struggle and now you have the ability to get the things done that you need to get done that you couldn’t do before because you had that roadblock okay so use AI don’t be scared of it uh hi Chase can I still begin that Discord challenge yeah we’re still doing the challenge it doesn’t end till the end of the month so there’s definitely time left to join the challenge you just go to shinify.com and you’ll get sent the checklist and then you just start going through the step-by-step videos check this out you just go 1 2 3 4 and uh there you go that’s the challenge there’s other stuff in here as well you can go through but all you have to do is just go to the link in the description shinify.com and that’s it we’ll see you inside hopefully and until next time happy moneymaking see you guys bye

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
April 11, 2025
Modern SQL Data Warehouse Project: A Comprehensive Guide
This source details the creation of a modern data warehouse project using SQL. It presents a practical guide to designing data architecture, writing code for data transformation and loading, and creating data models. The project emphasizes real-world implementation, focusing on organizing and preparing data for analysis. The resource covers the ETL process, data quality, and documentation while building bronze, silver, and gold layers. It provides a comprehensive approach to data warehousing, from understanding requirements to creating a professional portfolio project.

Modern SQL Data Warehouse Project Study Guide

Quiz:
1. What is the primary purpose of data warehousing projects? Data warehousing projects focus on organizing, structuring, and preparing data for data analysis, forming the foundation for any data analytics initiatives.
2. Briefly explain the ETL/ELT process in SQL data warehousing. ETL/ELT in SQL involves extracting data from various sources, transforming it to fit the data warehouse schema (cleaning, standardizing), and loading it into the data warehouse for analysis and reporting.
3. According to Bill Inmon’s definition, what are the four key characteristics of a data warehouse? According to Bill Inmon’s definition, the four key characteristics of a data warehouse are subject-oriented, integrated, time-variant, and non-volatile.
4. Why is creating a project plan crucial for data warehouse projects, according to the source? Creating a project plan is crucial for data warehouse projects because they are complex, and a clear plan improves the chances of success by providing organization and direction, reducing the risk of failure.
5. What is the “separation of concerns” principle in data architecture, and why is it important? The “separation of concerns” principle involves breaking down a complex system into smaller, independent parts, each responsible for a specific task, to avoid mixing everything and to maintain a clear and efficient architecture.
6. Explain the purpose of the bronze, silver, and gold layers in a data warehouse architecture. The bronze layer stores raw, unprocessed data directly from the source systems, the silver layer contains cleaned and standardized data, and the gold layer holds business-ready data transformed and aggregated for reporting and analysis.
7. What are metadata columns, and why are they useful in a data warehouse? Metadata columns are additional columns added to tables by data engineers to provide extra information about each record, such as create date or source system, aiding in data tracking and troubleshooting.
8. What is a surrogate key, and why is it used in data modeling? A surrogate key is a system-generated unique identifier assigned to each record to make the record unique. It provides more control over the data model without dependence on source system keys.
9. Describe the star schema data model, including the roles of fact and dimension tables. The star schema is a data modeling approach with a central fact table surrounded by dimension tables. Fact tables contain events or transactions, while dimension tables hold descriptive attributes, related via foreign keys.
10. Explain the importance of clear documentation for end users of a data warehouse, as highlighted in the source.
Clear documentation is essential for end users to understand the data model and use the data warehouse effectively.

Quiz Answer Key:
1. Data warehousing projects focus on organizing, structuring, and preparing data for data analysis, forming the foundation for any data analytics initiatives.
2. ETL/ELT in SQL involves extracting data from various sources, transforming it to fit the data warehouse schema (cleaning, standardizing), and loading it into the data warehouse for analysis and reporting.
3. According to Bill Inmon’s definition, the four key characteristics of a data warehouse are subject-oriented, integrated, time-variant, and non-volatile.
4. Creating a project plan is crucial for data warehouse projects because they are complex, and a clear plan improves the chances of success by providing organization and direction, reducing the risk of failure.
5. The “separation of concerns” principle involves breaking down a complex system into smaller, independent parts, each responsible for a specific task, to avoid mixing everything and to maintain a clear and efficient architecture.
6. The bronze layer stores raw, unprocessed data directly from the source systems, the silver layer contains cleaned and standardized data, and the gold layer holds business-ready data transformed and aggregated for reporting and analysis.
7. Metadata columns are additional columns added to tables by data engineers to provide extra information about each record, such as create date or source system, aiding in data tracking and troubleshooting.
8. A surrogate key is a system-generated unique identifier assigned to each record to make the record unique. It provides more control over the data model without dependence on source system keys.
9. The star schema is a data modeling approach with a central fact table surrounded by dimension tables. Fact tables contain events or transactions, while dimension tables hold descriptive attributes, related via foreign keys.
10. Clear documentation is essential for end users to understand the data model and use the data warehouse effectively.
Essay Questions:
1. Discuss the importance of data quality in a modern SQL data warehouse project. Explain the role of the bronze and silver layers in ensuring high data quality, and provide examples of data transformations that might be performed in the silver layer.
2. Describe the Medan architecture and how it’s implemented using bronze, silver, and gold layers. Discuss the advantages of this architecture, including separation of concerns and data quality management, and explain how data flows through each layer.
3. Explain the process of creating a detailed project plan for a data warehouse project using a tool like Notion. Describe the key phases and stages involved, the importance of defining epics and tasks, and how this plan contributes to project success.
4. Explain the importance of source system analysis in a data warehouse project, and describe the key questions that should be asked when connecting to a new source system.
5. Compare and contrast the star schema with other data modeling approaches, such as snowflake and data vault. Discuss the advantages and disadvantages of the star schema for reporting and analytics, and explain the roles of fact and dimension tables in this model.
Glossary of Key Terms:
- Data Warehouse: A subject-oriented, integrated, time-variant, and non-volatile collection of data designed to support management’s decision-making process.
- ETL (Extract, Transform, Load): A process in data warehousing where data is extracted from various sources, transformed into a suitable format, and loaded into the data warehouse.
- ELT (Extract, Load, Transform): A process similar to ETL, but the transformation step occurs after the data has been loaded into the data warehouse.
- Data Architecture: The overall structure and design of data systems, including databases, data warehouses, and data lakes.
- Data Integration: The process of combining data from different sources into a unified view.
- Data Modeling: The process of creating a visual representation of data structures and relationships.
- Bronze Layer: The first layer in a data warehouse architecture, containing raw, unprocessed data from source systems.
- Silver Layer: The second layer in a data warehouse architecture, containing cleaned and standardized data ready for transformation.
- Gold Layer: The third layer in a data warehouse architecture, containing business-ready data transformed and aggregated for reporting and analysis.
- Subject-Oriented: Focused on a specific business area, such as sales, customers, or finance.
- Integrated: Combines data from multiple source systems into a unified view.
- Time-Variant: Keeps historical data for analysis over time.
- Non-Volatile: Data is not deleted or modified once it enters the data warehouse.
- Project Epic: A large task or stage in a project that requires significant effort to complete.
- Separation of Concerns: A design principle that breaks down complex systems into smaller, independent parts, each responsible for a specific task.
- Data Cleansing: The process of correcting or removing inaccurate, incomplete, or irrelevant data.
- Data Standardization: The process of converting data into a consistent format or standard.
- Metadata Columns: Additional columns added to tables to provide extra information about each record, such as creation date or source system.
- Surrogate Key: A system-generated unique identifier assigned to each record, used to connect data models and avoid dependence on source system keys.
- Star Schema: A data modeling approach with a central fact table surrounded by dimension tables.
- Fact Table: A table in a data warehouse that contains events or transactions, along with foreign keys to dimension tables.
- Dimension Table: A table in a data warehouse that contains descriptive attributes or categories related to the data in fact tables.
- Data Lineage: Tracking the origin and movement of data from its source to its final destination.
- Stored Procedure: A precompiled collection of SQL statements stored under a name and executed as a single unit.
- Data Normalization: The process of organizing data to reduce redundancy and improve data integrity.
- Data Lookup: Joining tables to retrieve specific data, such as surrogate keys, from related dimensions.
- Data Flow Diagram: A visual representation of how data moves through a system.
Modern SQL Data Warehouse Project Guide

Okay, here’s a detailed briefing document summarizing the main themes and ideas from the provided text excerpts.

Briefing Document: Modern SQL Data Warehouse Project

Overview:

This document summarizes the key concepts and practical steps outlined in a guide for building a modern SQL data warehouse. The guide, presented by Bar Zini, aims to equip data architects, data engineers, and data modelers with real-world skills by walking them through the creation of a data warehouse project using SQL Server (though adaptable to other SQL databases). The project emphasizes best practices and provides a professional portfolio piece upon completion.

Main Themes and Key Ideas:
1. Data Warehousing Fundamentals:
- Definition: The project begins by defining a data warehouse using Bill Inmon’s classic definition: “A data warehouse is subject oriented, integrated, time variant, and nonvolatile collection of data designed to support the Management’s decision-making process.”
- Subject Oriented: Focused on business areas (e.g., sales, customers, finance).
- Integrated: Combines data from multiple source systems.
- Time Variant: Stores historical data.
- Nonvolatile: Data is not deleted or modified once entered.
- Purpose: To address the inefficiencies of data analysts extracting and transforming data directly from operational systems, replacing it with an organized and structured data system as a foundation for data analytics projects.
- SQL Data Warehousing in Relation to Other Types of Data Analytics Projects: The guide mentions that SQL Data Warehousing is the foundation of any data analytics projects and that it is the first step before being able to do exploratory data analyzes (EDA) and Advanced analytics projects.
1. Project Structure and Skills Developed:
- Roles: The project is designed to provide experience in three key roles: data architect, data engineer, and data modeler.
- Skills: Participants will learn:
- ETL/ELT processing using SQL.
- Data architecture design.
- Data integration (merging multiple sources).
- Data loading and data modeling.
- Portfolio Building: The guide emphasizes the project’s value as a portfolio piece for demonstrating skills on platforms like LinkedIn.
1. Project Setup and Planning (Using Notion):
- Importance of Planning: The guide stresses that “creating a project plan is the key to success.” This is particularly important for data warehouse projects, where a high failure rate (over 50%, according to Gartner reports) is attributed to complexity.
- Iterative Planning: The planning process is described as iterative. An initial “rough project plan” is created, which is then refined as understanding of the data architecture evolves.
- Project Epics (Main Phases): The initial project phases identified are:
- Requirements analysis.
- Designing the data architecture.
- Project initialization.
- Task Breakdown: The project uses Notion (a free tool) to organize the project into epics and subtasks, enabling a structured approach.
- It is also mentioned the importance of icons to add a personal style to the project and to keep it more organized.
- Project success: One important element of the project to be successful is to be able to visualize the whole picture in the project by closing small chunks of work and tasks that gives a sense of motivation and accomplishment.
1. Data Architecture Design (Using Draw.io):
- Medallion Architecture: The guide advocates for a “Medallion architecture” (Bronze, Silver, Gold layers) within the data warehouse.
- Separation of Concerns: A core architectural principle is “separation of concerns.” This means breaking down the complex system into independent parts, each responsible for a specific task, with no duplication of components. “A good data architect follow this concept this principle.”
- Layer Responsibilities:Bronze Layer (Raw Data): Contains raw data, with no transformations. “In the bronze layer it’s going to be the row data.”
- Silver Layer (Cleaned and Standardized Data): Focuses on data cleansing and standardization. “In the silver you are cleans standard data.”
- Gold Layer (Business-Ready Data): Contains business-transformed data ready for analysis. “For the gos we can say business ready data.”
- Data Flow Diagram: The project utilizes Draw.io (a free diagramming tool) to visualize the data architecture and data lineage.
- Naming Conventions: A naming convention is created to ensure clarity and consistency, creating specific naming rules for tables and columns. Examples include fact_sales for a fact table and dim_customers for a dimension. It is recommended to create clear documentation about each rule and to add examples so that there is a general consensus about how to proceed.
1. Project Initialization and Tools:
- Software: The project uses SQL Server Express (database server) and SQL Server Management Studio (client for interacting with the database). Other tools include GitHub and Draw.io. Notion is used for project management.
- Initial Database Setup: The guide outlines the creation of a new database and schemas (Bronze, Silver, Gold) within SQL Server.
- Git Repository: The project emphasizes the importance of using Git for version control and collaboration. A repository structure is established with folders for data sets, documents, scripts, and tests.
- ReadMe: it is important to create a read me file at the root of the repo where the main characteristics and goal of the repo are specified so that other developers can have a better understanding of the project when collaborating.
1. Building the Bronze Layer
- The process to build the bronze layer is by first doing data analysis about what is to be built. The goal of this first process is to interview source system experts, identify the source of the data, the size of the data to be processed, the performance of the source system so that it is not to be affected and authentication/authorization like access tokens, keys and passwords.
- The project also makes a step-by-step approach from creating all the required queries and stored procedures to loading them efficiently. This step contains steps about testing that the tables have no nulls and that the separator used matches with the data.
1. Building the Silver Layer
- The specifications of the silver layer are to have clean and standardized data and building tables inside the silver layer. The data should be loaded from the bronze layer using full load, truncating and then inserting the data after which we will apply a lot of data transformation.
- In the silver layer, we will implement metadata columns where more data information is stored that doesn’t come directly from the source system. Some examples that can be stored are create and update dates, the source system, and the file location where this data came from. This can help track where there are corrupted data as well as find if there is a gap in the imported data.
1. Building the Gold Layer
*The gold layer is very focused on business goals and should be easy to consume for business reports. That is why we will create a data model for our business area. *When implementing a data model, it should contain two types of tables: fact tables and dimension tables. Dimension tables are descriptive and give some context to the data. One example of a dimension table is to use product info to use the product name, category and subcategories. Fact tables are events like transactions that contain IDs from dimensions. The question to define whether we should use a dimension table or a fact table comes to be: * How much and How many: fact table *Who, What, and Where: dimension table
1. General Data Cleaning
- In the project we will be building data transformations and cleansing where we will be writing insert statements that will have functions where the data will be transformed and cleaned up. This will include data checks in the primary keys, handling unwanted space, identifying the inconsistencies of the cardinality (the number of elements in a table) where we will be replacing null values, and fixing the dates and values of the sales order.
- During the data cleaning process, one tool to check the quality of our data is through quality checks where we can go and select data that is incorrect, and then we can have a quick fix. For any numerical column it is best to validate it against the negative numbers, null values, and against the data type to make sure to convert into the right format. *In the silver layer, some techniques will have to be applied for the data that is old, in that case, it will have to be removed or have a flag, and for the birthday, we can filter data in the future. *To find errors in SQL, it is possible to use try and catch in between code blocks and then print error messages, numbers, and states so that the messages can be handled to find errors easier. *There is a lot of information that might have missing values. The code includes techniques to fill missing values and then also to provide data normalization.
In summary, this guide provides a comprehensive, practical approach to building a modern SQL data warehouse, emphasizing structured planning, sound architectural principles, and hands-on coding experience. The emphasis on building a portfolio project makes it particularly valuable for those seeking to demonstrate their data warehousing skills.

SQL Data Warehouse Fundamentals

# What is a modern SQL data warehouse?

A modern SQL data warehouse, according to the excerpt from “A Journey Through Grief”, is a subject-oriented, integrated, time-variant, and non-volatile collection of data designed to support management’s decision-making process. It consolidates data from multiple source systems, organizes it around business subjects (like sales, customers, or finance), retains historical data, and ensures that the data is not deleted or modified once loaded.

# What are the key roles involved in building a data warehouse project?

According to the excerpt from “A Journey Through Grief”, building a data warehouse involves different roles including:

* **Data Architect:** Designs the overall data architecture following best practices.

* **Data Engineer:** Writes code to clean, transform, load, and prepare data.

* **Data Modeler:** Creates the data model for analysis.

# What are the three types of data analytics projects that can be done using SQL?

The three types of data analytics projects, according to the excerpt from “A Journey Through Grief”, are:

* **Data Warehousing:** Focuses on organizing, structuring, and preparing data for analysis, which is foundational for other analytics projects.

* **Exploratory Data Analysis (EDA):** Involves understanding and uncovering insights from datasets by asking the right questions and finding answers using basic SQL skills.

* **Advanced Analytics Projects:** Uses advanced SQL techniques to answer business questions, such as identifying trends, comparing performance, segmenting data, and generating reports.

# What is the Medici architecture and why is it relevant to designing a data warehouse?

The Medici architecture is a layered approach to data warehousing, which this source calls “Medan” and which is composed of:

* **Bronze Layer:** Raw data “as is” from source systems.

* **Silver Layer:** Cleaned and standardized data.

* **Gold Layer:** Business-ready data with transformed and aggregated information.

The Medici architecture enables separation of concerns, allowing unique sets of tasks for each layer, and helps organize and manage the complexity of data warehousing. It provides a structured approach to data processing, ensuring data quality and consistency.

# What tools are commonly used in data warehouse projects, and why is creating a project plan important?

Common tools used in data warehouse projects include:

* **SQL Server Express:** A local server for the database.

* **SQL Server Management Studio (SSMS):** A client to interact with the database and run queries.

* **GitHub:** For version control and collaboration.

* **draw.io:** A tool for creating diagrams, data models, data architectures and data lineage.

* **Notion:** A tool for project management, planning, and organizing resources.

Creating a project plan is essential for success due to the complexity of data warehouse projects. A clear plan helps organize tasks, manage resources, and track progress.

# What is data lineage, and why is it important in a data warehouse environment?

Data lineage refers to the data’s journey from its origin in source systems, through various transformations, to its final destination in the data warehouse. It provides visibility into the data’s history, transformations, and dependencies. Data lineage is crucial for troubleshooting data quality issues, understanding data flows, ensuring compliance, and auditing data processes.

# What are surrogate keys, and why are they used in data modeling?

Surrogate keys are system-generated unique identifiers assigned to each record in a dimension table. They are used to ensure uniqueness, simplify data relationships, and insulate the data warehouse from changes in source system keys. Surrogate keys provide control over the data model and facilitate efficient data integration and querying.

# What are some essential naming conventions for data warehouse projects, and why are they important?

Essential naming conventions help ensure consistency and clarity across the data warehouse. Examples include:

* Using prefixes to indicate the type of table (e.g., `dim_` for dimension, `fact_` for fact).

* Consistent naming of columns (e.g., surrogate keys ending with `_key`, technical columns starting with `dw_`).

* Standardized naming for stored procedures (e.g., `load_bronze` for bronze layer loading).

These conventions improve collaboration, code readability, and maintenance, enabling efficient data management and analysis.

Data Warehousing: Architectures, Models, and Key Concepts

Data warehousing involves organizing, structuring, and preparing data for analysis and is the foundation for any data analytics project. It focuses on how to consolidate data from various sources into a centralized repository for reporting and analysis.

Key aspects of data warehousing:
- A data warehouse is subject-oriented, integrated, time-variant, and a nonvolatile collection of data designed to support management’s decision-making process.
- Subject-oriented: Focuses on specific business areas like sales, customers, or finance.
- Integrated: Integrates data from multiple source systems.
- Time-variant: Keeps historical data.
- Nonvolatile: Data is not deleted or modified once it’s in the warehouse.
- ETL (Extract, Transform, Load): A process to extract data from sources, transform it, and load it into the data warehouse, which then becomes the single source of truth for analysis and reporting.
- Benefits of a data warehouse:
- Organized data: A data warehouse helps organize data so that the data team is not fighting with the data.
- Single point of truth: Serves as a single point of truth for analyses and reporting.
- Automation: Automates the data collection and transformation process, reducing manual errors and processing time.
- Historical data: Enables access to historical data for trend analysis.
- Data integration: Integrates data from various sources, making it easier to create integrated reports.
- Improved decision-making: Provides fresh and reliable reports for making informed decisions.
- Data Management: Data management is important for making real and good decisions.
- Data Modeling: Data modeling is creating a new data model for analyses.
Different Approaches to Data Warehouse Architecture:
- Inmon Model: Uses a three-layer approach (staging, enterprise data warehouse, and data marts) to organize and model data.
- Kimball Model: Focuses on quickly building data marts, which may lead to inconsistencies over time.
- Data Vault: Adds more standards and rules to the central data warehouse layer by splitting it into raw and business vaults.
- Medallion Architecture: Uses three layers: bronze (raw data), silver (cleaned and standardized data), and gold (business-ready data).
The Medallion architecture consists of the following:
- Bronze Layer: Stores raw, unprocessed data directly from the sources for traceability and debugging.
- Data is not transformed in this layer.
- Typically uses tables as object types.
- Full load method is applied.
- Access restricted to data engineers only.
- Silver Layer: Stores clean and standardized data with basic transformations.
- Focuses on data cleansing, standardization, and normalization.
- Uses tables as object types.
- Full load method is applied.
- Accessible to data engineers, data analysts, and data scientists.
- Gold Layer: Contains business-ready data for consumption by business users and analysts.
- Applies business rules, data integration, and aggregation.
- Uses views as object types for dynamic access.
- Suitable for data analysts and business users.
The ETL Process: Extract, Transform, and Load

The ETL (Extract, Transform, Load) process is a critical component of data warehousing used to extract data from various sources, transform it into a usable format, and load it into a data warehouse. The data warehouse then becomes the single point of truth for analyses and reporting.

The ETL process consists of three key stages:
- Extract: Involves identifying and extracting data from source systems without changing it. The goal is to pull out a subset of data from the source in order to prepare it and load it to the target. This step focuses solely on data retrieval, maintaining a one-to-one correspondence with the source system.
- Transform: Manipulates and transforms the extracted data into a format suitable for analysis and reporting. This stage may include data cleansing, integration, formatting, and normalization to reshape the data into the required format.
- Load: Inserts the transformed data into the target data warehouse. The prepared data from the transformation step is moved into its final destination, such as a data warehouse.
In real-world projects, the data architecture may have multiple layers, and the ETL process can vary between these layers. Depending on the data architecture’s design, it is not always necessary to use the complete ETL process to move data from a source to a target. For example, data can be loaded directly to a layer without transformations or undergo only transformation or loading steps between layers.

Different techniques and methods exist within each stage of the ETL process:

Extraction:
- Methods:
- Pull: The data warehouse pulls data from the source system.
- Push: The source system pushes data to the data warehouse.
- Types:
- Full Extraction: All records from the source tables are extracted.
- Incremental Extraction: Only new or changed data is extracted.
- Techniques:
- Manual extraction
- Querying a database
- Parsing a file
- Connecting to an API
- Event-based streaming
- Change data capture (CDC)
- Web scraping
Transformation:
- Data enrichment
- Data integration
- Deriving new columns
- Data normalization
- Applying business rules and logic
- Data aggregation
- Data cleansing:
- Removing duplicates
- Data filtering
- Handling missing data
- Handling invalid values
- Removing unwanted spaces
- Casting data types
- Detecting outliers
Load:
- Processing Types:
- Batch Processing: Loading the data warehouse in one large batch of data.
- Stream Processing: Processing changes as soon as they occur in the source system.
- Methods:
- Full Load:
- Truncate and insert
- Upsert (update and insert)
- Drop, create, and insert
- Incremental Load:
- Upsert
- Insert (append data)
- Merge (update, insert, delete)
- Slowly Changing Dimensions (SCD):
- SCD0: No historization; no changes are tracked.
- SCD1: Overwrite; records are updated with new information, losing history.
- SCD2: Add historization by inserting new records for each change and inactivating old records.
Data Modeling for Warehousing and Business Intelligence

Data modeling is the process of organizing and structuring raw data into a meaningful way that is easy to understand. In data modeling, data is put into new, friendly, and easy-to-understand formats like customers, orders, and products. Each format is focused on specific information, and the relationships between those objects are described. The goal is to create a logical data model.

For analytics, especially in data warehousing and business intelligence, data models should be optimized for reporting, flexible, scalable, and easy to understand.

Different Stages of Data Modeling:
- Conceptual Data Model: Focuses on identifying the main entities (e.g., customers, orders, products) and their relationships without specifying details like columns or attributes.
- Logical Data Model: Specifies columns, attributes, and primary keys for each entity and defines the relationships between entities.
- Physical Data Model: Includes technical details like data types, lengths, and database-specific configurations for implementing the data model in a database.
Data Models for Data Warehousing and Business Intelligence:
- Star Schema: Features a central fact table surrounded by dimension tables. The fact table contains events or transactions, while dimensions contain descriptive information. The relationship between fact and dimension tables forms a star shape.
- Snowflake Schema: Similar to the star schema but breaks down dimensions into smaller sub-dimensions, creating a more complex, snowflake-like structure.
Comparison of Star and Snowflake Schemas:
- Star Schema:
- Easier to understand and query.
- Suitable for reporting and analytics.
- May contain duplicate data in dimensions.
- Snowflake Schema:
- More complex and requires more knowledge to query.
- Optimizes storage by reducing data redundancy through normalization.
- The star schema is commonly used and perfect for reporting.
Types of Tables:
- Fact Tables: Contain events or transactions and include IDs from multiple dimensions, dates, and measures. They answer questions about “how much” or “how many”.
- Dimension Tables: Provide descriptive information and context about the data, answering questions about “who,” “what,” and “where”.
In the gold layer, data modeling involves creating new structures that are easy to consume for business reporting and analyses.

Data Transformation: ETL Process and Techniques

Data transformation is a key stage in the ETL (Extract, Transform, Load) process where extracted data is manipulated and converted into a format that is suitable for analysis and reporting. It occurs after data has been extracted from its source and before it is loaded into the target data warehouse. This process is essential for ensuring data quality, consistency, and relevance in the data warehouse.

Here’s a detailed breakdown of data transformation, drawing from the sources:

Purpose and Importance
- Data transformation changes the shape of the original data.
- It is a heavy working process that can include data cleansing, data integration, and various formatting and normalization techniques.
- The goal is to reshape and reformat original data to meet specific analytical and reporting needs.
Types of Transformations There are various types of transformations that can be performed:
- Data Cleansing:
- Removing duplicates to ensure each primary key has only one record.
- Filtering data to retain relevant information.
- Handling missing data by filling in blanks with default values.
- Handling invalid values to ensure data accuracy.
- Removing unwanted spaces or characters to ensure consistency.
- Casting data types to ensure compatibility and correctness.
- Detecting outliers to identify and manage anomalous data points.
- Data Enrichment: Adding value to data sets by including relevant information.
- Data Integration: Bringing multiple sources together into a unified data model.
- Deriving New Columns: Creating new columns based on calculations or transformations of existing ones.
- Data Normalization: Mapping coded values to user-friendly descriptions.
- Applying Business Rules and Logic: Implementing criteria to build new columns based on business requirements.
- Data Aggregation: Aggregating data to different granularities.
- Data Type Casting: Converting data from one data type to another.
Data Transformation in the Medallion Architecture In the Medallion architecture, data transformation is strategically applied across different layers:
- Bronze Layer: No transformations are applied. The data remains in its raw, unprocessed state.
- Silver Layer: Focuses on basic transformations to clean and standardize data. This includes data cleansing, standardization, and normalization.
- Gold Layer: Focuses on business-related transformations needed for the consumers, such as data integration, data aggregation, and the application of business logic and rules. The goal is to provide business-ready data that can be used for reporting and analytics.
SQL Server for Data Warehousing

The sources mention SQL Server as a tool used for building data warehouses. It is a platform that can run locally on a PC where a database can reside.

Here’s what the sources indicate about using SQL Server in the context of data warehousing:
- Building a data warehouse: SQL Server can be used to develop a modern data warehouse.
- Project platform: In at least one of the projects described in the sources, the data warehouse was built completely in SQL Server.
- Data loading: SQL Server is used to load data from source files, such as CSV files, into database tables. The BULK INSERT command is used to load data quickly from a file into a table.
- Database and schema creation: SQL scripts are used to create a database and schemas within SQL Server to organize data.
- SQL Server Management Studio: SQL Server Management Studio is a client tool used to interact with the database and run queries.
- Three-layer architecture: The SQL Server database is organized into three schemas corresponding to the bronze, silver, and gold layers of a data warehouse.
- DDL scripts: DDL (Data Definition Language) scripts are created and executed in SQL Server to define the structure of tables in each layer of the data warehouse.
- Stored procedures: Stored procedures are created in SQL Server to encapsulate ETL processes, such as loading data from CSV files into the bronze layer.
- Data quality checks: SQL queries are written and executed in SQL Server to validate data quality, such as checking for duplicates or null values.
- Views in the gold layer: Views are created in the gold layer of the data warehouse within SQL Server to provide a business-ready, integrated view of the data.
SQL Data Warehouse from Scratch | Full Hands-On Data Engineering Project

The Original Text

hey friends so today we are diving into something very exciting Building Together modern SQL data warehouse projects but this one is not any project this one is a special one not only you will learn how to build a modern Data Warehouse from the scratch but also you will learn how I implement this kind of projects in Real World Companies I’m bar zini and I have built more than five successful data warehouse projects in different companies and right now I’m leading big data and Pi Projects at Mercedes-Benz so that’s me I’m sharing with you real skills real Knowledge from complex projects and here’s what you will get out of this project as a data architect we will be designing a modern data architecture following the best practices and as a data engineer you will be writing your codes to clean transform load and prepare the data for analyzis and as a data Modell you will learn the basics of data moding and we will be creating from the scratch a new data model for analyzes and my friends by the end of this project you will have a professional portfolio project to Showcase your new skills for example on LinkedIn so feel free to take the project modify it and as well share it with others but it going to mean the work for me if you share my content and guess what everything is for free so there are no hidden costs at all and in this project we will be using SQL server but if you prefer other databases like my SQL or bis don’t worry you can follow along just fine all right my friends so now if you want to do data analytics projects using SQL we have three different types the first type of projects you can do data warehousing it’s all about how to organize structure and prepare your data for data analysis it is the foundations of any data analytics projects and in The Next Step you can do exploratory data analyzes Eda and all what you have to do is to understand and cover insights about our data sets in this kind of project you can learn how to ask the right questions and how to find the answer using SQL by just using basic SQL skills now moving on to the last stage where you can do Advanced analytics projects where you going to use Advanced SQL techniques in order to answer business questions like finding Trends over time comparing the performance segmenting your data into different sections and as well generate reports for your stack holders so here you will be solving real business questions using Advanced SQL techniques now what we’re going to do we’re going to start with the first type of projects SQL data warehousing where you will gain the following skills so first you will learn how to do ETL elt processing using SQL in order to prepare the data you will learn as well how to build data architecture how to do data Integrations where we can merge multiple sources together and as well how to do data load and data modeling so if I got you interested grab your coffee and let’s jump to the projects all right my friends so now before we Deep dive into the tools and the cool stuff we have first to have good understanding about what is exactly a data warehouse why the companies try to build such a data management system so now the question is what is a data warehouse I will just use the definition of the father of the data warehouse Bill Inon a data warehouse is subject oriented integrated time variance and nonvolatile collection of data designed to support the Management’s decision-making process okay I I know that might be confusing subject oriented it means thata Warehouse is always focused on a business area like the sales customers finance and so on integrated because it goes and integrate multiple Source systems usually you build a warehouse not only for one source but for multiple sources time variance it means you can keep historical data inside the data warehouse nonvolatile it means once the data enter the data warehouse it is not deleted or modified so this is how build and mod defined data warehouse okay so now I’m going to show you the scenario where your company don’t have a real data management so now let’s say that you have one system and you have like one data analyst has to go to this system and start collecting and extracting the data and then he going to spend days and sometimes weeks transforming the row data into something meaningful then once they have the report they’re going to go and share it and this data analyst is sharing the report using an Excel and then you have like another source of data and you have another data analyst that she is doing maybe the same steps collecting the data spending a lot of time transforming the data and then share at the end like a report and this time she is sharing the data using PowerPoint and a third system and the same story but this time he is sharing the data using maybe powerbi so now if the company works like this then there is a lot of issues first this process it take too way long I saw a lot of scenarios where sometimes it takes weeks and even months until the employee manually generating those reports and of course what going to happen for the users they are consuming multiple reports with multiple state of the data one report is 40 days old another one 10 days and a third one is like 5 days so it’s going to be really hard to make a real decision based on this structure a manual process is always slow and stressful and the more employees you involved in the process the more you open the door for human errors and errors of course in reports leads to bad decisions and another issue of course is handling the Big Data if one of your sources generating like massive amount of data then the data analyst going to struggle collecting the data and maybe in some scenarios it will not be any more possible to get the data so the whole process can breaks and you cannot generate any more fresh data for specific reports and one last very big issue with that if one of your stack holders asks for an integrated report from multiple sources well good luck with that because merging all those data manually is very chaotic timec consuming and full of risk so this is just a picture if a company is working without a proper data management without a data leak data warehouse data leak houses so in order to make real and good decisions you need data management so now let’s talk about the scenario of a data warehouse so the first thing that can happen is that you will not have your data team collecting manually the data you’re going to have a very important component called ETL ETL stands for extract transform and load it is a process that you do in order to extract the data from the sources and then apply multiple Transformations on those sources and at the end it loads the data to the data warehouse and this one going to be the single point of Truth for analyzes and Reporting and it is called Data Warehouse so now what can happen all your reports going to be consuming this single point of Truth so with that you create your multiple reports and as well you can create integrated reports from multiple sources not only from one single source so now by looking to the right side it looks already organized right and the whole process is completely automated there is no more manual steps which of course it ru uses the human error and as well it is pretty fast so usually you can load the data from the sources until the reports in matter of hours or sometimes in minutes so there is no need to wait like weeks and months in order to refresh anything and of course the big Advantage is that the data warehouse itself it is completely integrated so that means it goes and bring all those sources together in one place which makes it really easier for reporting and not only integrate you can build in the data warehouse as well history so we have now the possibility to access historical data and what is also amazing that all those reports having the same data status so all those reports can have the same status maybe sometimes one day old or something and of course if you have a modern Data Warehouse in Cloud platforms you can really easily handle any big data sources so no need to panic if one of your sources is delivering massive amount of data and of course in order to build the data warehouse you need different types of Developers so usually the one that builds the ATL component and the data warehouse is the data engineer so they are the one that is accessing the sources scripting the atls and building the database for the data warehouse and now for the other part the one that is responsible for that is the data analyst they are the one that is consuming the data warehouse building different data models and reports and sharing it with the stack holders so they are usually contacting the stack holders understanding the requirements and building multiple reports based on the data warehouse so now if you have a look to those two scenarios this is exactly why we need data management your data team is not wasting time and fighting with the data they are now more organized and more focused and with like data warehouse and you are delivering professional and fresh reports that your company can count on in order to make good and fast decisions so this is why you need a data management like a data warehouse think about data warehouse as a busy restaurant every day different suppliers bring in fresh ingredients vegetables spices meat you name it they don’t just use it immediately and throw everything in one pot right they clean it shop it and organize everything and store each ingredients in the right place fridge or freezer so this is the preparing face and when the order comes in they quickly grab the prepared ingredients and create a perfect dish and then serve it to the customers of the restaurant and this process is exactly like the data warehouse process it is like the kitchen where the raw ingredients your data are cleaned sorted and stored and when you need a report or analyzes it is ready to serve up exactly like what you need okay so now we’re going to zoom in and focus on the component ETL if you are building such a project you’re going to spend almost 90% just building this component the ATL so it is the core element of the data warehouse and I want you to have a clear understanding what is exactly an ETL so our data exist in a source system and now what we want to do is is to get our data from the source and move it to the Target source and Target could be like database tables so now the first step that we have to do is to specify which data we have to load from the source of course we can say that we want to load everything but let’s say that we are doing incremental loads so we’re going to go and specify a subset of the data from The Source in order to prepare it and load it later to the Target so this step in the ATL process we call it extract we are just identifying the data that we need we pull it out and we don’t change anything it’s going to be like one to one like the source system so the extract has only one task to identify the data that you have to pull out from the source and to not change anything so we will not manipulate the data at all it can stay as it is so this is the first step in the ETL process the extracts now moving on to the stage number two we’re going to take this extract data and we will do some manipulations Transformations and we’re going to change the shape of those data and this process is really heavy working we can do a lot of stuff like data cleansing data integration and a lot of formatting and data normalizations so a lot of stuff we can do in this step so this is the second step in the ETL process the transformation we’re going to take the original data and reshape it transformat into exactly the format that we need into a new format and shapes that we need for anal and Reporting now finally we get to the last step in the ATL process we have the load so in this step we’re going to take this new data and we’re going to insert it into the targets so it is very simple we’re going to take this prepared data from the transformation step and we’re going to move it into its final destination the target like for example data warehouse so that’s ETL in the nutshell first extract the row data then transform it into something meaningful and finally load it to a Target where it’s going to make a difference so that’s that’s it this is what we mean with the ETL process now in real projects we don’t have like only source and targets our thata architecture going to have like multiple layers depend on your design whether you are building a warehouse or a data lake or a data warehouse and usually there are like different ways on how to load the data between all those layers and in order now to load the data from one layer to another one there are like multiple ways on how to use the ATL process so usually if you are loading the data from the source to the layer number one like only the data from the source and load it directly to the layer number one without doing any Transformations because I want to see the data as it is in the first layer and now between the layer number one and the layer number two you might go and use the full ETL so we’re going to extract from the layer one transform it and then load it to the layer number two so with that we are using the whole process the ATL and now between Layer Two and layer three we can do only transformation and then load so we don’t have to deal with how to extract the data because it is maybe using the same technology and we are taking all data from Layer Two to layer three so we transform the whole layer two and then load it to layer three and now between three and four you can use only the L so maybe it’s something like duplicating and replicating the data and then you are doing the transformation so you load to the new layer and then transform it of course this is not a real scenario I’m just showing you that in order to move from source to a Target you don’t have always to use a complete ETL depend on the design of your data architecture you might use only few components from the ETL okay so this is how ETL looks like in real projects okay so now I would like to show you an overview of the different techniques and methods in the etls we have wide range of possibilities where you have to make decisions on which one you want to apply to your projects so let’s start first with the extraction the first thing that I want to show you is we have different methods of extraction either you are going to The Source system and pulling the data from the source or the source system is pushing the data to the data warehouse so those are the two main methods on how to extract data and then we have in the extraction two types we have a full extraction everything all the records from tables and every day we load all the data to the data warehouse or we make more smarter one where we say we’re going to do an incremental extraction where every day we’re going to identify only the new changing data so we don’t have to load the whole thing only the new data we go extract it and then load it to the data warehouse and in data extraction we have different techniques the first one is like manually where someone has to access a source system and extract the data manually or we connect ourself to a database and we have then a query in order to extract the data or we have a file that we have to pass it to the data warehouse or another technique is to connect ourself to API and do their cods in order to extract the data or if the data is available in streaming like in kfka we can do event based streaming in order to extract the data another way is to use the change data capture CDC is as well something very similar to streaming or another way is by using web scrapping where you have a code that going to run and extract all the informations from the web so those are the different techniques and types that we have in the extraction now if you are talking on the transformation there are wide range of different Transformations that we can do on our data like for example doing data enrichment where we add values to our data sets or we do a data integration where we have multiple sources and we bring everything to one data model or we derive a new of columns based on already existing one another type of data Transformations we have the data normalization so the sources has values that are like a code and you go and map it to more friendly values for the analyzers which is more easier to understand and to use another Transformations we have the business rules and logic depend on the business you can Define different criterias in order to build like new columns and what belongs to Transformations is the data aggregation so here we aggregate the data to a different granularity and then we have type of transformation called Data cleansing there are many different ways on how to clean our data for example removing the duplicates doing data filtering handling the missing data handling invalid values or removing unwanted spaces casting the data types and detecting the outliers and many more so we have different types of data cleansing that we can do in our data warehouse and this is very important transformation so as you can see we have different types of Transformations that we can do in our data warehouse now moving on to the load so what do we have over here we have different processing types so either we are doing patch processing or stream processing patch processing means we are loading the data warehouse in one big patch of data that’s going to run and load the data warehouse so it is only one time job in order to refresh the content of the data warehouse and as well the reports so that means we are scheduling the data warehouse in order to load it in the day once or twice and the other type we have the stream processing so this means if there is like a change in the source system we going to process this change as soon as possible so we’re going to process it through all the layers of the data warehouse once something changes from The Source system so we are streaming the data in order to have real time data warehouse which is very challenging things to do in data warehousing and if you are talking about the loads we have two methods either we are doing a full load or incremental load it’s a same thing as extraction right so for the full load in databases there are like different methods on how to do it like for example we trate and then insert that means we make the table completely empty and then we insert everything from the scratch or another one you are doing an update insert we call it upsert so we can go and update all the records and then insert the new one and another way is to drop create an insert so that means we drop the whole table and then we create it from scratch and then we insert the data it is very similar to the truncate but here we are as well removing and drubbing the whole table so those are the different methods of full loads the incremental load we can use as well the upserts so update and inserts so we’re going to do an update or insert statements to our tables or if the source is something like a log we can do only inserts so we can go and Abend the data always to the table without having to update anything another way to do incremental load is to do a merge and here it is very similar to the upsert but as well with a delete so update insert delete so those are the different methods on how to load the data to your tables and one more thing in data warehousing we have something called slowly changing Dimensions so here it’s all about the hyz of your table and there are many different ways on how to handle the Hyer in your table the first type is sd0 we say there is no historization and nothing should be changed at all so that means you are not going to update anything the second one which is more famous it is the sd1 you are doing an override so that means you are updating the records with the new informations from The Source system by overwriting the old value so we are doing something like the upsert so update and insert but you are losing of course history another one we have the scd2 and here you want to add historization to your table so what we do so what we do each change that we get from The Source system that means we are inserting new records and we are not going to overwrite or delete the old data we are just going to make it inactive and the new record going to be active one so there are different methods on how to do historization as well while you are loading the data to the data warehouse all right so those are the different types and techniques that you might encounter in data management projects so now what I’m going to show you quickly which of those types we will be using in our projects so now if we are talking about the extraction over here we will be doing a pull extraction and about the full or incremental it’s going to be a full extraction and about the technique we are going to be passsing files to the data warehouse and now about the data transformation well this one we will cover everything all those types of Transformations that I’m showing you now is going to be part of the project because I believe in each data project you will be facing those Transformations now if we have a look to the load our project going to be patch processing and about the load methods we will be doing a full load since we have full extraction and it’s going to be trunk it and inserts and now about the historization we will be doing the sd1 so that means we will be updating the content of the thata Warehouse so those are the different techniques and types that we will be using in our ETL process for this project all right so with that we have now clear understanding what is a data warehouse and we are done with the theory parts so now the next step we’re going to start with the projects the first thing that you have to do is to prepare our environment to develop the projects so let’s start with that all right so now we go to the link in the description and from there we’re going to go to the downloads and and here you can find all the materials of all courses and projects but the one that we need now is the SQL data warehouse projects so let’s go to the link and here we have bunch of links that we need for the projects but the most important one to get all data and files is this one download all project files so let’s go and do that and after you do that you’re going to get a zip file where you have there a lot of stuff so let’s go and extract it and now inside it if you go over here you will find the reposter structure from git and the most important one here is the data ass sets so you have two sources the CRM and the Erp and in each one of them there are three CSV files so those are the data set for the project for the other stuffs don’t worry about it we will be explaining that during the project so go and get the data and put it somewhere at your PC where you don’t lose it okay so now what else do we have we have here a link to the get repository so this is the link to my repository that I have created through the projects so you can go and access it but don’t worry about it we’re going to explain the whole structure during the project and you will be creating your own repository and as well we have the link to the notion here we are doing the project management here you’re going to find the main steps the main phes of the SQL projects that we will do and as well all the task that we will be doing together during the projects and now we have links to the project tools so if you don’t have it already go and download the SQL Server Express so it’s like a server that going to run locally at your PC where your database going to live another one that you have to download is the SQL Server management Studio it is just a client in order to interact with the database and there we’re going to run all our queries and then link to the GitHub and as well link to the draw AO if you don’t have it already go and download it it is free and amazing tool in order to draw diagrams so through the project we will be drawing data models the data architecture a data lineage so a lot of stuff we’ll be doing using this tool so go and download it and the last thing it is nice to have you have a link to the notion where you can go and create of course free account accounts if you want to build the project plan and as well Follow Me by creating the project steps and the project tasks okay so that’s all those are all the links for the projects so go and download all those stuff create the accounts and once you are ready then we continue with the projects all right so now I hope that you have downloaded all the tools and created the accounts now it’s time to move to very important step that’s almost all people skip while doing projects and then that is by creating the project plan and for that we will be using the tool notion notion is of course free tool and it can help you to organize your ideas your plans and resources all in one place I use it very intensively for my private projects like for example creating this course and I can tell you creating a project plan is the key to success creating a data warehouse project is usually very complex and according to Gardner reports over 50% of data warehouse projects fail and my opinion about any complex project the key to success is to have a clear project plan so now at this phase of the project we’re going to go and create a rough project plan because at the moment we don’t have yet clear understanding about the data architecture so let’s go okay so now let’s create a new page and let’s call it data warehouse projects the first thing is that we have to go and create the main phases and stages of the projects and for that we need a table so in order to do that hit slash and then type database in line and then let’s go and call it something like data warehouse epic and we’re going to go and hide it because I don’t like it and then on the table we can go and rename it like for example project epics something like that and now what we’re going to do we’re going to go and list all the big task of the projects so an epic is usually like a large task that needs a lot of efforts in order to solve it so you can call it epics stages faces of the project whatever you want so we’re going to go and list our project steps so it start with the requirements analyzes and then designing data architecture and another one we have the project initialization so those are the three big task in the project first and now what do we need we need another table for the small chunks of the tasks the subtasks and we’re going to do the same thing so we’re going to go and hit slash and we’re going to search for the table in line and we’re going to do the same thing so first we’re going to call it data warehouse tasks and then we’re going to hide it and over here we’re going to rename it and say this is the project tasks so now what we’re going to do we’re going to go to the plus icon over here and then search for relation this one over here with the arrow and now we’re going to search for the name of the first table so we called it data warehouse iix so let’s go and click it and we’re going to say as well two-way relation so let’s go and add the relation so with that we got a fi in the new table called Data Warehouse iix this comes from this table and as well we have here data warehouse tasks that comes from from the below table so as you can see we have linked them together now what I’m going to do I’m going to take this to the left side and then what we’re going to do we’re going to go and select one of those epics like for example let’s take design the data architecture and now what we’re going to do we’re going to go and break down this Epic into multiple tasks like for example choose data management approach and then we have another task what we’re going to do we’re going to go and select as well the same epic so maybe the next step is brainstorm and design the layers and then let’s go to another iic for example the project initialization and we say over here for example create get repo prepare the structure we can go and make another one in the same epic let’s say we’re going to go and create the database and the schemas so as you can see I’m just defining the subtasks of those epics so now what we’re going to do we’re going to go and add a checkbox in order to understand whether we have done the task or not so we go to the plus and search for check we need the check box and what we’re going to do we’re going to make it really small like this and with that each time we are done with the task we’re going to go and click on it just to make sure that we have done the task now there is one more thing that is not really working nice and that is here we’re going to have like a long list of tasks and it’s really annoying so what we’re going to do we’re going to go to the plus over here and let’s search for roll up so let’s go and select it so now what we’re going to do we have to go and select the relationship it’s going to be that data warehouse task and after that we’re going to go to the property and make it as the check box so now as you can see in the first table we are saying how many tasks is closed but I don’t want to show it like this what you going to do we’re going to go to the calculation and to the percent and then percent checked and with that we can see the progress of our project and now instead of the numbers we can have really nice bar great so as well we can go and give it a name like progress so that’s it and we can go and hide the data warehouse tasks and now with that we have really nice progress bar for each epic and if we close all the tasks of this epic we can see that we have reached 100% so this is the main structure now we can go and add some cosmetics and rename stuff in order to make things looks nicer like for example if I go to the tasks over here I can go and call it tasks and as well go and change the icon to something like this and if you’d like to have an icon for all those epics what we going to do we’re going to go to the Epic for example design data architecture and then if you hover on top of the title you can see add an icon and you can go and pick any icon that you want so for example this one and now now as you can see we have defined it here in the top and the icon going to be as well in the pillow table okay so now one more thing that we can do for the project tasks is that we can go and group them by the epics so if you go to the three dots and then we go to groups and then we can group up by the epics and as you can see now we have like a section for each epic and you can go and sort the epics if you want if you go over here sort then manual and you can go over here and start sorting the epics as you want and with that you can expand and minimize each task if you don’t want to see always all tasks in one go so this is really nice way in order to build like data management for your projects of course in companies we use professional Tools in order to do projects like for example Gyra but for private person projects that I do I always do it like this and I really recommend you to do it not only for this project for any project that you are doing CU if you see the whole project in one go you can see the big picture and closing tasks and doing it like this these small things can makes you really satisfied and keeps you motivated to finish the whole project and makes you proud okay friends so now I just went and added few icons a rename stuff and as well more tasks for each epic and this going to be our starting point in the project and once we have more informations we’re going to go and add more details on how exactly we’re going to build the data warehouse so at the start we’re going to go and analyze and understand the requirements and only after that we’re going to start designing the data architecture and here we have three tasks first we have to to choose the data management approach and after that we’re going to do brainstorming and designing the layers of the data warehouse and at the end we’re going to go and draw a data architecture so with that we have clear understanding how the data architecture looks like and after that we’re going to go to the next epic where we’re going to start preparing our projects so once we have clear understanding of the data architecture the first task here is to go and create detailed project tasks so we’re going to go and add more epes and more tasks and once we are done then we’re going to go and create the naming conventions for the project just to make sure that we have rules and standards in the whole project and next we’re going to go and create a repository in the git and we can to prepare as well the structure of the repository so that we always commit our work there and then we can start with the first script where we can create a database and schemas so my friends this is the initial plan for the project now let’s start with the first epic we have the requirements analyzes now analyzing the requirement it is very important to understand which type of data wehous you’re going to go and build because there is like not only one standard on how to build it and if you go blindly implementing the data warehouse you might be doing a lot of stuff that is totally unnecessary and you will be burning a lot of time so that’s why you have to sit with the stockholders with the department and understand what we exactly have to build and depend on the requirements you design the shape of the data warehouse so now let’s go and analyze the requirement of this project now the whole project is splitted into two main sections the first section we have to go and build a data warehouse so this is a data engineering task and we will go and develop etls and data warehouse and once we have done that we have to go and build analytics and reporting business intelligence so we’re going to do data analysis but now first we will be focusing on the first part building the data warehouse so what do you have here the statement is very simple it says develop a modern data warehouse using SQL Server to consolidate sales data enabling analytical reporting and informed decision making so this is the main statements and then we have specifications the first one is about the data sources it says import data from two Source systems Erb and CRM and they are provided as CSV files and now the second task is talking about the data quality we have to clean and fix data quality issues before we do the data analyses because let’s be real there is no R data that is perfect is always missing and we have to clean that up now the next task is talking about the integration so it says we have to go and combine both of the sources into one single userfriendly data model that is designed for analytics and Reporting so that means we have to go and merge those two sources into one single data model and now we have here another specifications it says focus on the latest data sets so there is no need for historization so that means we don’t have to go and build histories in the the database and the final requirement is talking about the documentation so it says provide clear documentations of the data model so that means the last product of the data warehouse to support the business users and the analytical teams so that means we have to generate a manual that’s going to help the users that makes lives easier for the consumers of our data so as you can see maybe this is very generic requirements but it has a lot of information already for you so it’s saying that we have to use the platform SQL Server we have two Source systems using using the CSV files and it sounds that we really have a bad data quality in the sources and as well it wants us to focus on building completely new data model that is designed for reporting and it says we don’t have to do historization and it is expected from us to generate documentations of the system so these are the requirements for the data engineering part where we’re going to go and build a data warehouse that fulfill these requirements all right so with that we have analyzed the requirements and as well we have closed at the first easiest epic so we are done with this let’s go and close it and now let’s open another one here we have to design the data architecture and the first task is to choose data management approach so let’s go now designing the data architecture it is exactly like building a house so before construction starts an architect going to go and design a plan a blueprint for the house how the rooms will be connected how to make the house functional safe and wonderful and without this blueprint from The Architects the builders might create something unstable inefficient or maybe unlivable the same goes for data projects a data architect is like a house architect they design how your data will flow integrate and be accessed so as data Architects we make sure that the data warehouse is not only functioning but also scalable and easy to maintain and this is exactly what we will do now we will play the role of the data architect and we will start brainstorming and designing the architecture of the data warehouse so now I’m going to show you a sketch in order to understand what are the different approaches in order to design a data architecture and this phase of the projects usually is very exciting for me because this is my main role in data projects I am a data architect and I discuss a lot of different projects where we try to find out the best design for the projects all right so now let’s go now the first step of building a data architecture is to make very important decision to choose between four major types the first approach is to build a data warehouse it is very suitable if you have only structured data and your business want to build solid foundations for reporting and business intelligence and another approach is to build a data leak this one is way more flexible than a data warehouse where you can store not only structured data but as well semi and unstructured data we usually use this approach if you have mixed types of data like database tables locks images videos and your business want to focus not only on reporting but as well on Advanced analytics or machine learning but it’s not that organized like a data warehouse and data leaks if it’s too much unorganized can turns into Data swamp and this is where we need the next approach so the next one we can go and build data leak house so it is like a mix between data warehouse and data leak you get the flexibility of having different types of data from the data Lake but you still want to structure and organiz your data like we do in the data warehouse so you mix those two words into one and this is a very modern way on how to build data Architects and this is currently my favorite way of building data management system now the last and very recent approach is to build data Mish so this is a little bit different instead of having centralized data management system the idea now in the data Mish is to make it decentralized you cannot have like one centralized data management system because always if you say centralized then it means bottleneck so instead you have multiple departments and multiple domains where each one of them is building a data product and sharing it with others so now you have to go and pick one of those approaches and in this project we will be focusing on the data warehouse so now the question is how to build the data warehouse well there is as well four different approaches on how to build it the first one is the inone approach so again you have your sources and the first layer you start with the staging where the row data is landing and then the next layer you organize your data in something called Enterprise data Warehouse where you go and model the data using the third normal format it’s about like how to structure and normalize your tables so you are building a new integrated data model from the multiple sources and then we go to the third layer it’s called the data Mars where you go and take like small subset of the data warehouse and you design it in a way that is ready to be consumed from reporting and it focus on only one toque like for example the customers sales or products and after that you go and connect your bi tool like powerbi or Tableau to the data Mars so with that you have three layers to prepare the data before reporting now moving on to the next one we have the kle approach he says you know what building this Enterprise data warehouse it is wasting a lot of time so what we can do we can jump immediately from the stage layer to the final data marks because building this Enterprise data warehouse it is a big struggle and usually waste a lot of time so he always want you to focus and building the data marks quickly as possible so it is faster approach than Inon but with the time you might get chaos in the data Mars because you are not always focusing in the big picture and you might be repeating same Transformations and Integrations in different data Mars so there is like trade-off between the speed and consistent data warehouse now moving on to the third approach we have the Data Vault so we still have the stage and the data Mars but it says we still need this Central Data Warehouse in the middle but this middle layer we’re going to bring more standards and rules so it tells you to split this middle layer into two layers the row Vault and the business vault in the row Vault you have the original data but in the business Vault you have all the business rules and Transformations that prepares the data for the data Mars so Data Vault it is very similar to the in one but it brings more standards and rules to the middle layer now I’m going to go and add a fourth one that I’m going to call it Medallion architecture and this one is my favorite one because it is very easy to understand and to build so it says you’re going to go and build three layers bronze silver and gold the bronze layer it is very similar to the stage but we have understood with the time that the stage layer is very important because having the original data as it is it going to helps a lot by tracebility and finding issues then the next layer we have the silver layer it is where we do Transformations data cleansy but we don’t apply yet any business rules now moving on to the last layer the gold layer it is as well very similar to the data Mars but there we can build different typ type of objects not only for reporting but as well for machine learning for AI and for many different purposes so they are like business ready objects that you want to share as a data product so those are the four approaches that you can use in order to build a data warehouse so again if you are building a data architecture you have to specify which approach you want to follow so at the start we said we want to build a data warehouse and then we have to decide between those four approaches on how to build the data warehouse and in this project we will be using using The Medallion architecture so this is a very important question that you have to answer as the first step of building a data architecture all right so with that we have decided on the approach so we can go and Mark it as done the next step we’re going to go and design the layers of the data warehouse now there is like not 100% standard way and rules for each layer what you have to do as a data architect you have to Define exactly what is the purpose of each layer so we start with the bronze layer so we say it going to store row and unprocessed data as it is from the sources and why we are doing that it is for tracebility and debugging if you have a layer where you are keeping the row data it is very important to have the data as it is from the sources because we can go always back to the pron layer and investigate the data of specific Source if something goes wrong so the main objective is to have row untouched data that’s going to helps you as a data engineer by analyzing the road cause of issues now moving on to the silver layer it is the layer where we’re going to store clean and standardized data and this is the place where we’re going to do basic transformations in order to prepare the data for the final layer now for the good layer it going to contain business ready data so the main goal here is to provide data that could be consumed by business users and analysts in order to build reporting and analytics so with that we have defined the main goal for each layer now next what I would like to do is to to define the object types and since we are talking about a data warehouse in database we have here generally two types either a table or a view so we are going for the bronze layer and the silver layer with tables but for the gold layer we are going with the views so the best practice says for the last layer in your data warehouse make it virtual using views it going to gives you a lot of dynamic and of course speed in order to build it since we don’t have to make a load process for it and now the next step is that we’re going to go and Define the load method so in this project I have decided to go with the full load using the method of trating and inserting it is just faster and way easier so we’re going to say for the pron layer we’re going to go with the full load and you have to specify as well for the silver layer as well we’re going to go with the full load and of course for the views we don’t need any load process so each time you decide to go with tables you have to define the load methods with full load incremental loads and so on now we come to the very interesting part the data Transformations now for the pron layer it is the easiest one about this topic because we don’t have any transformations we have to commit ourself to not touch the data do not manipulate it don’t change anything so it’s going to stay as it is if it comes bad it’s going to stay bad in the bronze layer and now we come to the silver layer where we have the heavy lifting as we committed in the objective we have to make clean and standardized data and for that we have different types of Transformations so we have to do data cleansing data standardizations data normalizations we have to go and derive new columns and data enrichment so there are like bunch of trans transformation that we have to do in order to prepare the data our Focus here is to transform the data to make it clean and following standards and try to push all business transformations to the next layer so that means in the god layer we will be focusing on business Transformations that is needed for the consumers for the use cases so what we do here we do data Integrations between Source system we do data aggregations we apply a lot of business Logics and rules and we build a data model that is ready for for example business inions so here we do a lot of business Transformations and in the silver layer we do basic data Transformations so it is really here very important to make the fine decisions what type of transformations to be done in each layer and make sure that you commit to those rules now the next aspect is about the data modeling in the bronze layer and the silver layer we will not break the data model that comes from the source system so if the source system deliver five tables we’re going to have here like five tables and as well in the silver layer we will not go and D normalize or normalize or like make something new we’re going to leave it exactly like it comes from the source system because what we’re going to do we’re going to build the data model in the gold layer and here you have to Define which data model you want to follow are you following the star schema the snowflake or are you just making aggregated objects so you have to go and make a list of all data models types that you’re going to follow in the gold layer and at the end what you can specify in each layer is the target audience and this is of course very important decision in the bronze layer you don’t want to give access access to any end user it is really important to make sure that only data Engineers access the bronze layer it makes no sense for data analysts or data scientist to go to the bad data because you have a better version for that in the silver layer so in the silver layer of course the data Engineers have to have an access to it and as well the data analysts and the data scientist and so on but still you don’t give it to any business user that can’t deal with the row data model from the sources because for the business users you’re going to get a bit layer for them and that is the gold layer so the gold layer it is suitable for the data analyst and as well the business users because usually the business users don’t have a deep knowledge on the technicality of the Sero layer so if you are designing multiple layers you have to discuss all those topics and make clear decision for each layer all right my friends so now before we proceed with the design I want to tell you a secret principle Concepts that each data architect must know and that is the separation of concerns so what is that as you are designing an architecture you have to make sure to break down the complex system into smaller independent parts and each part is responsible for a specific task and here comes the magic the component of your architecture must not be duplicated so you cannot have two parts are doing the same thing so the idea here is to not mix everything and this is one of the biggest mistakes in any big projects and I have sewn that almost everywhere so a good data architect follow this concept this principle so for example if you are looking to our data architecture we have already done that so we have defined unique set of tasks for each layer so for example we have said in the silver layer we do data cleansing but in the gold layer we do business Transformations and with that you will not be allowing to do any business transformations in the silver layer and the same thing goes for the gold layer you don’t do in the gold layer any data cleansing so each layer has its own unique tasks and the same thing goes for the pron layer and the silver layer you do not allow to load data from The Source systems directly to the silver layer because we have decided the landing layer the first layer is the pron layer otherwise you will have like set of source systems that are loaded first to the pron layer and another set is skipping the layer and going to the silver and with that we have overlapping you are doing data inje in two different layers so my friends if you have this mindsets separation of concerns I promise you you’re going to be a data architect so think about it all right my friends so with that we have designed the layers of the data warehouse we can go and close it the next step we’re going to go to draw o and start drawing the data architecture so there is like no one standard on how to build a data architecture you can add your style and the way that you want so now the first thing that we have to show in data architecture is the different layers that we have the first layer is the source system layer so let’s go and take a box like this and make it a little bit bigger and I’m just going to go and make the design so I’m going to remove the fill and make the line dotted one and after dots I’m going to go and change maybe the color to something like this gray so now we have like a container for the first layer and then we have to go and add like a text on top of it so what I’m going to do I’m going to take another box let’s go and type inside it sources and I’m going to go and style it so I’m going to go to the text and make it maybe 24 and then remove the lines like this make it a little bit smaller and put it on top so this is the first layer this is where the data come from and then the data going to go inside a data warehouse so I’m just going to go and duplicate this one this one is the data warehouse all right so now the third layer what is going to be it’s going to be the consumers who will be consuming this data warehouse so I’m going to put another box and say this is the consume layer okay so those are the three containers now inside the data warehouse we have decided to build it using the Medan architecture so we’re going to have three layers inside the warehouse so I’m going to take again another box I’m going to call this one this is the bronze layer and now we have to go and put a design for it so I’m going to go with this color over here and then the text and maybe something like 20 and then make it a little bit smaller and just put it here and beneath that we’re going to have the component so this is just a title of a container so I’m going to have it like this this remove the text from inside it and remove the filling so this container is for the bronze layer let’s go and duplicate it for the next one so this one going to be the silver layer and of course we can go and change the coloring to gray because it is silver and as well the lines and remove the filling great and now maybe I’m going to make the font as bold all right now the third layer going to be the gold layer and we have to go and pick it color for that so style and here we have like something like yellow the same thing for the container I remove the filling so with that we are showing now the different layers inside our data warehouse now those containers are empty what we’re going to do we’re going to go inside each one of them and start adding contents so now in the sources it is very important to make it clear what are the different types of source system that you are connecting to the data warehouse because in real project there are like multiple types you might have a database API files CFA and here it’s important to show those different types in our projects we have folders and inside those folders We have CSV files so now what you have to do we have to make it clear in this layer that the input for our project is CSV file so it really depend how you want to show that I’m going to go over here and say maybe folder and then I’m going to go and take the folder and put it here inside and then maybe search for file more results and go pick one of those icons for example I’m going to go with this one over here so I’m going to make it smaller and add it on top of the folder so with that we make it clear for everyone seeing the architecture that the sources is not a database is not an API it is a file inside the folder so now very important here to show is the source systems what are the sources that is involved in the project so here what we’re going to do we’re going to go and give it a name for example we have one source called CRM B like this and maybe make the icon and we have another source called Erp so we going to go and duplicate it put it over here and then rename it Erp so now it is for everyone clear we have two sources for the this project and the technology is used is simply a file so now what we can do as well we can go and add some descriptions inside this box to make it more clear so what I’m going to do I’m going to take a line because I want to split the description from the icons something like this and make it gray and then below it we’re going to go and add some text and we’re going to say is CSV file and the next point and we can say the interface is simply files in folder and of course you can go and add any specifications and explanation about the sources if it is a database you can see the type of the database and so on so with that we made it in the data architecture clear what are the sources of our data warehouse and now the next step what we’re going to do we’re going to go and design the content of the bronze silver and gold so I’m going to start by adding like an icon in each container it is to show about that we are talking about database so what we’re going to do we’re going to go and search for database and then more result more results I’m going to go with this icon over here so let’s go and make it it’s bigger something like this maybe change the color of that so we’re going to have the bronze and as well here the silver and the gold so now what we’re going to do we’re going to go and add some arrows between those layers so we’re going to go over here so we can go and search for Arrow and maybe go and pick one of those let’s go and put it here and we can go and pick a color for that maybe something like this and adjust it so now we can have this nice Arrow between all the layers just to explain the direction of our architecture right so we can read this from left to right and as well between the gold layer and the consume okay so now what I’m going to do next we’re going to go and add one statement about each layer the main objective so let’s go and grab a text and put it beneath the database and we’re going to say for example for the bl’s layer it’s going to be the row data maybe make the text bigger so you are the row data and then the next one in the silver you are cleans standard data and then the last one for the gos we can say business ready data so with that we make the objective clear for each layer now below all those icons what we going to do we’re going to have a separator again like this make it like colored and beneath it we’re going to add the most important specifications of this layer so let’s go and add those separators in each layer okay so now we need a text below it let’s take this one here so what is the object type of the bronze layer it’s going to be a table and we can go and add the load methods we say this is patch processing since we are not doing streaming we can say it is a full load we are not doing incremental load so we can say here Tran and insert and then we add one more section maybe about the Transformations so we can say no Transformations and one more about the data model we’re going to say none as is and now what I’m going to do I’m going to go and add those specifications as well for the silver and gold so here what we have discussed the object type the load process the Transformations and whether we are breaking the data model or not the same thing for the gold layer so I can say with that we have really nice layering of the data warehouse and what we are left is with the consumers over here you can go and add the different use cases and tools that can access your data warehouse like for example I’m adding here business intelligence and Reporting maybe using poweri or Tau or you can say you can access my data warehouse in order to do atoc analyzes using the SQ queries and this is what we’re going to focus on the projects after we buil the data warehouse and as well you can offer it for machine learning purposes and of course it is really nice to add some icons in your architecture and usually I use this nice websites called Flat icon it has really amazing icons that you can go and use it in your architecture now of course we can go and keep adding icons and stuff to explain the data architecture and as well the system like for example it is very important here to say which tools you are using in order to build this data warehouse is it in the cloud are you using Azure data breaks or maybe snowflake so we’re going to go and add for our project the icon of SQL Server since we are building this data warehouse completely in the SQL Server so for now I’m really happy about it as you can see we have now a plan right all right guys so with that we have designed the data architecture using the drw O and with that we have done the last step in this epic and now with that we have a design for the data architecture and we can say we have closed this epic now let’s go to the next one we will start doing the first step to prepare our projects and the first task here is to create a detailed project plan all right my friends so now it’s clear for us that we have three layers and we have to go and build them so that means our big epic is going to be after the layers so here I have added three more epics so we have build bronze layer build silver layer and gold layer and after that I went and start defining all the different tasks that we have to follow in the projects so at the start will be analyzing then coding and after that we’re going to go and do testing and once everything is ready we’re going to go and document stuff and at the end we have to commit our work in the get repo all those epics are following the same like pattern in the tasks so as you can see now we have a very detailed project structure and now things are more cleared for us how we going to build the data warehouse so with that we are done from this task and now the next task we have to go and Define the naming Convention of the projects all right so now at this phase of the projects we usually Define the naming conventions so what is that it a set of rules that you define for naming everything in the projects whether it is a database schema tables start procedures folders anything and if you don’t do that at the early phase of the project I promise you chaos can happen because what going to happen you will have different developers in your projects and each of those developers have their own style of course so one developer might name a tabled Dimension customers where everything is lowercase and between them underscore and you have another developer creating another table called Dimension products but using the camel case so there is no separation between the words and the first character is capitalized and maybe another one using some prefixes like di imore categories so we have here like a shortcut of the dimension so as you can see there are different designs and styles and if you leave the door open what can happen in the middle of the projects you will notice okay everything looks inconsistence and you can define a big task to go and rename everything following specific role so instead of wasting all this time at this phase you go and Define the naming conventions and let’s go and do that so we will start with a very important decision and that is which naming convention we going to follow in the whole project so you have different cases like the camel case the Pascal case the Kebab case and the snake case and for this project we’re going to go with the snake case where all the letters of award going to be lowercase and the separation between wordss going to be an underscore for example a table name called customer info customer is lowercased info is as well lowercased and between them an underscore so this is always the first thing that you have to decide for your data project the second thing is to decide the language so for example I work in Germany and there is always like a decision that we have to make whether we use Germany or English so we have to decide for our project which language we’re going to use and a very important general rule is that avoid reserved words so don’t use a square reserved word as an object name like for example table don’t give a table name as a table so those are the general principles so those are the general rules that you have to follow in the whole project this applies for everything for tables columns start procedures any names that you are giving in your scripts now moving on we have specifications for the table names and here we have different set of rules for each layer so here the rule says Source system uncore entity so we are saying all the tables in the bronze layer should start first with the source system name like for example CRM or Erb and after that we have an underscore and then at the end we have the entity name or the table name so for example we have this table name CRM uncore so that means this table comes from the source system CRM and then we have the table name the entity name customer info so this is the rule that we’re going to follow in naming all tables in the pron layer then moving on to the silver layer it is exactly like the bronze because we are not going to rename anything we are not going to build any new data model so the naming going to be one to one like the bronze so it is exactly the same rules as the bronze but if we go to the gold here since we are building new data model we have to go and rename things and since as well we are integrating multi sources together we will not be using the source system name in the tables because inside one table you could have multiple sources so the rule says all the names must be meaningful business aligned names for the tables starting with the category prefix so here the rule says it start with category then underscore and then entity now what is category we have in the go layer different types of tables so we could build a table called a fact table another one could be a dimension a third type could be an aggregation or report so we have different types of tables and we can specify those types as a perect at the start so for example we are seeing here effect uncore sales so the category is effect and the table name called sales and here I just made like a table with different type of patterns so we could have a dimension so we say it start with the di imore for example the IM customers or products and then we have another type called fact table so it starts with fact underscore or aggregated table where we have the fair three characters like aggregating the customers or the sales monthly so as you can see as you are creating a naming convention you have first to make it clear what is the rule describe each part of the rule and start giving examples so with that we make it clear for the whole team which names they should follow so we talked here about the table naming convention then you can as well go and make naming convention for the columns like for example in the gold layer we’re going to go and have circuit keys so we can Define it like this the circuit key should start with a table name and then underscore a key like for example we can call it customer underscore key it is a surrogate key in the dimension customers the same thing for technical columns as a data engineer we might add our own columns to the tables that don’t come from the source system and those columns are the technical columns or sometimes we call them metadata columns now in order to separate them from the original columns that comes from the source system we can have like a prefix for that like for example the rule says if you are building any technical or metadata columns the column should start with dwore and then that column name for example if you want the metadata load date we can have dwore load dates so with that if anyone sees that column starts with DW we understand this data comes from a data engineer and we can keep adding rules like for example the St procedure over here if you are making an ETL script then it should should start with the prefix load uncore and then the layer for example the St procedure that is responsible for loading the bronze going to be called load uncore bronze and for the Silver Load uncore silver so those are currently the rules for the St procedure so this is how I do it usually in my projects all right my friends so with do we have a solid namey conventions for our projects so this is done and now the next with that we’re going to go to git and you will create a brand new repository and we’re going to prepare its structure so let’s go go all right so now we come to as well important step in any projects and that’s by creating the git repository so if you are new to git don’t worry about it it is simpler than it sounds so it’s all about to have a safe place where you can put your codes that you are developing and you will have the possibility to track everything happen to the codes and as well you can use it in order to collaborate with your team and if something goes wrong you can always roll back and the best part here once you are done with the project you can share your reposter as a part of your portfolio and it is really amazing thing if you are applying for a job by showcasing your skills that you have built a data warehouse by using well documented get reposter so now let’s go and create the reposter of the project now we are at the overview of our account so the first thing that you have to do is to go to the repos stories over here and then we’re going to go to this green button and click on you the first thing that we have to do is to give Theory name so let’s call it SQL data warehouse project and then here we can go and give it a description so for example I’m saying building a modern data warehouse with SQL Server now the next option whether you want to make it public and private I’m going to leave it as a public and then let’s go and add here a read me file and then here about the license we can go over here and select the MIT MIT license gives everyone the freedom of using and modifying your code okay so I think I’m happy with the setup let’s go and create the repost story and with that we have our brand new reposter now the next step that I usually do is to create the structure of the reposter and usually I always follow the same patterns in any projects so here we need few folders in order to put our files right so what I usually do I go over here to add file create a new file and I start creating the structure over here so the first thing is that we need data sets then slash and with that the repos you can understand this is a folder not a file and then you can go and add anything like here play holder just an empty file this just can to help me to create the folders so let’s go and commit so commit the changes and now if you go back to the main projects you can see now we have a folder called data sets so I’m going to go and keep creating stuff so I will go and create the documents placeholder commit the changes and then I’m going to go and create the scripts Place holder and the final one what I usually add is the the tests something like this so with that as you can see now we have the main folders of our repository now what I usually do the next with that I’m going to go and edit the main readme so you can see it over here as well so what we’re going to do we’re going to go inside the read me and then we’re going to go to the edit button here and we’re going to start writing the main information about our project this is really depend on your style so you can go and add whatever you want this is the main page of your repository and now as you can see the file name here ismd it stands for markdown it is just an easy and friendly format in order to write a text so if you have like documentations you are writing a text it is a really nice format in order to organize it structure it and it is very friendly so what I’m going to do at the start I’m going to give a few description about the project so we have the main title and then we have like a welcome message and what this reposter is about and in the next section maybe we can start with the project requirements and then maybe at the end you can say few words about the licensing and few words about you so as you can see it’s like the homepage of the project and the repository so once you are done we’re going to go and commit the changes and now if you go to the main page of the repository you can see always the folder and files at the start and then below it we’re going to see the informations from the read me so again here we have the welcome statement and then the projects requirements and at the end we have the licensing and about me so my friends that’s that’s it we have now a repost story and we have now the main structure of the projects and through the projects as we are building the data warehouse we’re going to go and commit all our work in this repository nice right all right so with that we have now your repository ready and as we go in the projects we will be adding stuff to it so this step is done and now the last step finally we’re going to go to the SQL server and we’re going to write our first scripts where we’re going to create a database and schemas all right now the first step is we have to go and create brand new database so now in order to do that first we have to switch to the database master so you can do it like this use master and semicolon and if you go and execute it now we are switched to the master database it is a system database in SQL Server where you can go and create other databases and you can see from the toolbar that we are now logged into the master database now the next step we have to go and create our new database so we’re going to say say create database and you can call it whatever you want so I’m going to go with data warehouse semicolon let’s go and execute it and with that we have created our database let’s go and check it from the object Explorer let’s go and refresh and you can see our new data warehouse this is our new database awesome right now to the next step we’re going to go and switch to the new database so we’re going to say use data warehouse and semicolon so let’s go and switch to it and you can see now now we are logged into the data warehouse database and now we can go and start building stuff inside this data warehouse so now the first step that I usually do is I go and start creating the schemas so what is the schema think about it it’s like a folder or a container that helps you to keep things organized so now as we decided in the architecture we have three layers bronze silver gold and now we’re going to go and create for each layer a schema so let’s go and do that we’re going to start with the first one create schema and the first one is bronze so let’s do it like this and a semicolon let’s go and create the first schema nice so we have new schema let’s go to our database and then in order to check the schemas we go to the security and then to the schemas over here and as you can see we have the bronze and if you don’t find it you have to go and refresh the whole schemas and then you will find the new schema great so now we have the first schema now what we’re going to do we’re going to go and create the others two so I’m just going to go and duplicate it so the next one going to be the silver and the third one going to be the golds so let’s go and execute those two together we will get an error and that’s because we are not having the go in between so after each command let’s have a go and now if I highlight the silver and gold and then execute it will be working the go in SQL it is like separator so it tells SQL first execute completely the First Command before go to the next one so it is just separator now let’s go to our schemas refresh and now we can see as well we have the gold and the silver so with this we have now a database we have the three layers and we can start developing each layer individually okay so now let’s go and commit our work in the git so now since it is a script and code we’re going to go to the folder scripts over here and then we’re going to go and add a new file let’s call it init database.sql and now we’re going to go and paste our code over here so now I have done few modifications like for example before we create the database we have to check whether the database exists this is an important step if you are recreating the database otherwise if you don’t do that you will get an error where it’s going going to say the database already exists so first it is checking whether the database exist then it drops it I have added few comments like here we are saying creating the data warehouse creating the schemas and now we have a very important step we have to go and add a header comment at the start of each scripts to be honest after 3 months from now you will not be remembering all the details of these scripts and adding a comment like this it is like a sticky note for you later once you visit this script again and it is as well very important for the other developers in the team because each time you open a scripts the first question going to be what is the purpose of this script because if you or anyone in the team open the file the first question going to be what is the purpose of these scripts why we are doing these stuff so as you can see here we have a comment saying this scripts create a new data warehouse after checking if it already exists if the database exists it’s going to drop it and recreate it and additionally it’s going to go and create three schemas bronze silver gold so that it gives Clarity what this script is about and it makes everyone life easier now the second reason why this is very important to add is that you can add warnings and especially for this script it is very important to add these notes because if you run these scripts what’s going to happen it’s going to go and destroy the whole database imagine someone open the script and run it imagine an admin open the script and run it in your database everything going to be destroyed and all the data will be lost and this going to be a disaster if you don’t have any backup so with that we have nice H our comment and we have added few comments in our codes and now we are ready to commit our codes so let’s go and commit it and now we have our scripts in the git as well and of course if you are doing any modifications make sure to update the changes in the Gs okay my friends so with that we have an empty database and schemas and we are done with this task and as well we are done with the whole epic so we have completed the project initialization and now we’re going to go to the interesting stuff we will go and build the bronze layer so now the first task is to analyze the source systems so let’s go all right so now the big question is how to build the bronze layer so first thing first we do analyzing as you are developing anything you don’t immediately start writing a code so before we start coding the bronze layer what we usually do is we have to understand the source system so what I usually do I make an interview with the source system experts and ask them many many questions in order to understand the nature of the source system that I’m connecting to the data warehouse and once you know the source systems then we can start coding and the main focus here is to do the data ingestion so that means we have to find a way on how to load the data from The Source into the data warehouse so it’s like we are building a bridge between the source and our Target system the data warehouse and once we have the code ready the next step is we have to do data validation so here comes the quality control it is very important in the bronze layer to check the data completeness so that means we have to compare the number of Records between the source system and the bronze layer just to make sure we are not losing any data in between and another check that we will be doing is the schema checks and that’s to make sure that the data is placed on the right position and finally we don’t have to forget about documentation and committing our work in the gits so this is the process that we’re going to follow to build the bronze layer all right my friends so now before connecting any Source systems to our data warehouse we have to make very important step is to understand the sources so how I usually do it I set up a meeting with the source systems experts in order to interview them to ask them a lot of stuff about the source and gaining this knowledge is very important because asking the right question will help you to design the correct scripts in order to extract the data and to avoid a lot of mistakes and challenges and now I’m going to show you the most common questions that I usually ask before connecting anything okay so we start first by understanding the business context and the ownership so I would like to understand the story behind the data I would like to understand who is responsible for the data which it departments and so on and then it’s nice to understand as well what business process it supports does it support the customer transactions the supply chain Logistics or maybe Finance reporting so with that you’re going to understand the importance of your data and then I ask about the system and data documentation so having documentations from the source is your learning materials about your data and it going to saves you a lot of time later when you are working and designing maybe new data models and as well I would like always to understand the data model for the source system and if they have like descript I of the columns and the tables it’s going to be nice to have the data catalog this can helps me a lot in the data warehouse how I’m going to go and join the tables together so with that you get a solid foundations about the business context the processes and the ownership of the data and now in The Next Step we’re going to start talking about the technicality so I would like to understand the architecture and as well the technology stack so the first question that I usually ask is how the source system is storing the data do we have the data on the on Prem like an SQL Server Oracle or is it in the cloud like Azure lws and so on and then once we understand that then we can discuss what are the integration capabilities like how I’m going to go and get the data do the source system offer apis maybe CFA or they have only like file extractions or they’re going to give you like a direct connection to the database so once you understand the technology that you’re going to use in order to extract the data then we’re going to Deep dive into more technical questions and here we can understand how to extract the data from The Source system and and then load it into the data warehouse so the first things that we have to discuss with the experts can we do an incremental load or a full load and then after that we’re going to discuss the data scope the historization do we need all data do we need only maybe 10 years of the data are there history is already in the source system or should we build it in the data warehouse and so on and then we’re going to go and discuss what is the expected size of the extracts are we talking here about megabytes gigabytes terabytes and this is very important to understand whether we have the right tools and platform to connect the source system and then I try to understand whether there are any data volume limitations like if you have some Old Source systems they might struggle a lot with performance and so on so if you have like an ETL that extracting large amount of data you might bring the performance down of the source system so that’s why you have to try to understand whether there are any limitations for your extracts and as well other aspects that might impact the performance of The Source system this is very important if they give you an access to the database you have to be responsible that you are not bringing the performance of the database down and of course very important question is to ask about the authentication and the authorization like how you going to go and access the data in the source system do you need any tokens Keys password and so on so those are the questions that you have to ask if you are connecting new source system to the data warehouse and once you have the answers for those questions you can proceed with the next steps to connect the sources to the that Warehouse all right my friends so with that you have learned how to analyze a new source systems that you want to connect to your data warehouse so this STP is done and now we’re going to go back to coding where we’re going to write scripts in order to do the data ingestion from the CSV files to the Bros layer and let’s have quick look again to our bronze layer specifications so we just have to load the data from the sources to the data warehouse we’re going to build tables in the bronze layer we are doing a full load so that means we are trating and then inserting the data there will be no data Transformations at all in the bronze layer and as well we will not be creating any data model so this is the specifications of the bronze layer all right now in order to create the ddl script for the bronze layer creating the tables of the bronze we have to understand the metadata the structure the schema of the incoming data and here either you ask the technical experts from The Source system about these informations or you can go and explore the incoming data and try to define the structure of your tables so now what we’re going to do we’re going to start with the First Source system the CRM so let’s go inside it and we’re going to start with the first table that customer info now if you open the file and check the data inside it you see we have a Header information and that is very good because now we have the names of the columns that are coming from the source and from the content you can Define of course the data types so let’s go and do that first we’re going to say create table and then we have to define the layer it’s going to be the bronze and now very important we have to follow the naming convention so we start with the name of the source system it is the CRM underscore and then after that the table name from The Source system so it’s going to be the costore info so this is the name of our first table in the bronze layer then the next step we have to go and Define of course the columns and here again the column names in the bronze layer going to be one to one exactly like the source system so the first one going to be the ID and I will go with the data type integer then the next one going to be the key invar Char and the length I will go with [Music] 50 and the last one going to be the create dates it’s going to be date so with that we have covered all the columns available from The Source system so let’s go and check and yes the last one is the create date so that’s it for the first table now semicolon of course at the end let’s go and execute it and now we’re going to go to the object Explorer over here refresh and we can see the first table inside our data warehouse amazing right so now next what you have to do is to go and create a ddl statement for each file for those two systems so for the CRM we need three ddls and as well for the other system the Erp we have as well to create three ddls for the three files so at the ends we’re going to have in the bronze ler Six Tables six ddls so now pause the video go create those ddls I will be doing the same as well and we will see you soon all right so now I hope you have created all those details I’m going to show you what I have just created so the second table in the source CRM we have the product informations and the third one is the sales details then we go to the second system and here we make sure that we are following the naming convention so first The Source system Erb and then the table name so the second system was really easy you can see we have only here like two columns and for the customers like only three and for the categories only four columns all right so after defining those stuff of course we have to go and execute them so let’s go and do that and then we go to the object Explorer over here refresh the tables and with that you can see we have six empty tables in the bronze layer and with that we have all the tables from the two Source systems inside our database but still we don’t have any data and you can see our naming convention is really nice you see the first three tables comes from the CRM Source system and then the other three comes from the Erb so we can see in the bronze layer the things are really splitted nicely and you can identify quickly which table belonged to which source system now there is something else that I usually add to the ddl script is to check whether the table exists before creating so for example let’s say that you are renaming or you would like to change the data type of specific field if you just go and run this Square you will get an error because the database going to say we have already this table so in other databases you can say create or replace table but in the SQL Server you have to go and build a tsql logic so it is very simple first we have to go and check whether the object exist in the database so we say if object ID and then we have to go and specify the table name so let’s go and copy the whole thing over here and make sure you get exactly the same name as a table name so there is see like space I’m just going to go and remove it and then we’re going to go and Define the object type so going to be the U it stands for user it is the user defined tables so if this table is not null so this means the database did find this object in the database so what can happen we say go and drop the table so the whole thing again and semicolon so again if the table exist in the database is not null then go and drop the table and after that go and created so now if you go and highlight the whole thing and then execute it it will be working so first drop the table if it exist then go and create the table from scratch now what you have to do is to go and add this check before creating any table inside our database so it’s going to be the same thing for the next table and so on I went and added all those checks for each table and what can happen if I go and execute the whole thing it going to work so with that I’m recreating all the tables in the bronze layer from the scratch now the methods that we’re going to use in order to load the data from the source to the data warehouse is the bulk inserts bulk insert is a method of loading massive amount of data very quickly from files like CSV files or maybe a text file directly into a database it’s is not like the classical normal inserts where it’s going to go and insert the data row by row but instead the PK insert is one operation that’s going to load all the data in one go into the database and that’s what makes it very fast so let’s go and use this methods okay so now let’s start writing the script in order to load the first table in the source CRM so we’re going to go and load the table customer info from the CSV file to the database table so the syntax is very simple we’re going to start to saying pulk insert so with that SQL understand we are doing not a normal insert we are doing a pulk insert and then we have to go and specify the table name so it is bronze. CRM cost info so now now we have to specify the full location of the file that we are trying to load in this table so now what we have to do is to go and get the path where the file is stored so I’m going to go and copy the whole path and then add it to the P insert exactly like where the data exists so for me it is in csql data warehouse project data set in the source CRM and then I have to specify the file name so it’s going to be the costore info. CSV you have to get it exactly like like the path of your files otherwise it will not be working so after the path now we come to the with CLA now we have to tell the SQL Server how to handle our file so here comes the specifications there is a lot of stuff that we can Define so let’s start with the very important one is the row header now if you check the content of our files you can see always the first row includes the Header information of the file so those informations are actually not the data it’s just the column names the ACT data starts from the second row and we have to tell the database about this information so we’re going to say first row is actually the second row so with that we are telling SQL to skip the first row in the file we don’t need to load those informations because we have already defined the structure of our table so this is the first specifications the next one which is as well very important and loading any CSV file is the separator between Fields the delimiter between Fields so it’s really depend on the file structure that you are getting from the source as you can see all those values are splitted with a comma and we call this comma as a file separator or a delimiter and I saw a lot of different csvs like sometime they use a semicolon or a pipe or special character like a hash and so on so you have to understand how the values are splitted and in this file it’s splitted by the comma and we have to tell SQL about this info it’s very important so we going to say fill Terminator and then we’re going to say it is the comma and basically those two informations are very important for SQL in order to be able to read your CSV file now there are like many different options that you can go and add for example tabe lock it is an option in order to improve the performance where you are locking the entire table during loading it so as SQL is loading the data to this table it going to go and lock the whole table so that’s it for now I’m just going to go and add the semicolon and let’s go and insert the data from the file inside our pron table let’s execute it and now you can see SQL did insert around 880,000 rows inside our table so it is working we just loaded the file into our data Bas but now it is not enough to just write the script you have to test the quality of your bronze table especially if you are working with files so let’s go and just do a simple select so from our new table and let’s run it so now the first thing that I check is do we have data like in each column well yes as you can see we have data and the second thing is do we have the data in the correct column this is very critical as you are loading the data from a file to a database do we have the data in the correct column so for example here we have the first name which of course makes sense and here we have the last name but what could happen and this mistakes happens a lot is that you find the first name informations inside the key and as well you see the last name inside the first name and the status inside the last name so there is like shifting of the data and this data engineering mistake is very common if you are working with CSV files and there are like different reasons why it happens maybe the definition of your table is wrong or the filled separator is wrong maybe it’s not a comma it’s something else or the separator is a bad separator because sometimes maybe in the keys or in the first name there is a comma and the SQL is not able to split the data correctly so the quality of the CSV file is not really good and there are many different reasons why you are not getting the data in the correct column but for now everything looks fine for us and the next step is that I go and count the rows inside this table so let’s go and select that so we can see we have 18,490 and now what we can do we can go to our CSV file and check how many rows do we have inside this file and as you can see we have 18,490 we are almost there there is like one extra row inside the file and that’s because of the header the first Header information is not loaded inside our table and that’s why always in our tables we’re going to have one less row than the original files so everything looks nice and we have done this step correctly now if I go and run it again what’s going to happen we will get dcat inside the bronze layer so now we have loaded the file like twice inside the same table which is not really correct the method that we have discussed is first to make the table empty and then load trate and then insert in order to do that before the bulk inserts what we’re going to do we’re going to say truncate table and then we’re going to have our table and that’s it with a semicolon so now what we are doing is first we are making the table empty and then we start loading from the scratch we are loading the whole content of the file inside the table and this is what we call full load so now let’s go and Mark everything together and execute and again if you go and check the content of the table you can see we have only 18,000 rows let’s go and run it again the count of the bronze layer you can see we still have the 18,000 so each time you run this script now we are refreshing the table customer info from the file into the database table so we are refreshing the bronze layer table so that means if there is like now any changes in the file it will be loaded to the table so this is how you do a full load in the bronze layer by trating the table and then doing the inserts and now of course what we have to do is to Bow the video and go and write WR the same script for all six files so let’s go and do [Music] that okay back so I hope that you have as well written all those scripts so I have the three tables in order to load the First Source system and then three sections in order to load the Second Source system and as I’m writing those scripts make sure to have the correct path so for the Second Source system you have to go and change the path for the other folder and as well don’t forget the table name on the bronze layer is different from the file name because we start always with the source system name with the files we don’t have that so now I think I have everything is ready so let’s go and execute the whole thing perfect awesome so everything is working let me check the messages so we can see from the message how many rows are inserted in each table and now of course the task is to go through each table and check the content so that means now we have really ni script in order to load the bronze layer and we will use this script in daily basis every day we have to run it in order to get a new content to the data warehouse and as you learned before if you have like a script of SQL that is frequently used what we can do we can go and create a stored procedure from those scripts so let’s go and do that it’s going to be very simple we’re going to go over here and say create or alter procedure and now we have to define the name of the Sol procedure I’m going to go and put it in the schema bronze because it belongs to the bronze layer so then we’re going to go and follow the naming convention the S procedure starts with load underscore and then the bronze layer so that’s it about the name and then very important we have to define the begin and as well the end of our SQL statements so here is the beginning and let’s go to the end and say this is the end and then let’s go highlight everything in between and give it one push with tab so with that it is easier to read so now next one we’re going to do we’re going to go and execute it so let’s go and create this St procedure and now if you want to go and check your St procedure you go to the database and then we have here folder called programmability and then inside we have start procedure so if you go and refresh you will see our new start procedure let’s go and test it so I’m going to go and have new query and what we’re going to do we’re going to say execute bronze. load bronze so let’s go and execute it and with that we have just loaded completely the pron layer so as you can see SQL did go and insert all the data from the files to the bronze layer it is way easier than each time running those scripts of course all right so now the next step is that as you can see the output message it is really not having a lot of informations the message of your ETL with s procedure it will not be really clear so that’s why if you are writing an ETL script always take care of the messaging of your code so let me show you a nice design let’s go back to our St procedure so now what we can do we can go and divide the message p based on our code so now we can start with a message for example over here let’s say print and we say what you are doing with this thir procedure we are loading the bronze ler so this is the main message the most important one and we can go and play with the separators like this so we can say print and now we can go and add some nice separators like for example the equals at the start and at the end just to have like a section so this is just a nice message at the start so now by looking to our code we can see that our code is splited into two sections the first section we are loading all the tables from The Source system CRM and the second section is loading the tables from the Erp so we can split the prints by The Source system so let’s go and do that so we’re going to say print and we’re going to say loading CRM tables this is for the first section and then we can go and add some nice separators like the one let’s take the minus and of course don’t forget to add semicolons like me so we can to have semicolon for each print same thing over here I will go and copy the whole thing because we’re going to have it at the start and as well at the end let’s go copy the whole thing for the second section so for the Erp it starts over here and we’re going to have it like this and we’re going to call it loading Erp so with that in the output we can see nice separation between loading each Source system now we go to the next step where we go and add like a print for each action so for example here we are Tran getting the table so we say print and now what we can do we can go and add two arrows and we say what we are doing so we are trating the table and then we can go and add the table name in the message as well so this is the first action that we are doing and we can go and add another print for inserting the data so we can say inserting data into and then we have the table name so with that in the output we can understand what SQL is doing so let’s go and repeat this for all other tables Okay so I just added all those prints and don’t forget the semicolon at the end so I would say let’s go and execute it and check the output so let’s go and do that and then maybe at the start just to have quick output execute our stored procedure like this so let’s see now if you check the output you can see things are more organized than before so at the start we are reading okay we are loading the bronze layer now first we are loading the source system CRM and then the second section is for the Erp and we can see the actions so we trating inserting trating inserting for each table and as well the same thing for the Second Source so as you can see it is nice and cosmetic but it’s very important as you are debugging any errors and speaking of Errors we have to go and handle the errors in our St procedure so let’s go and do that it’s going to be the first thing that we do we say begin try and then we go to the end of our scripts and we say before the last end we say end try and then the next thing we have to add the catch so we’re going to say begin catch and end catch so now first let’s go and organize our code I’m going to take the whole codes and give it one more push and as well the begin try so it is more organized and as you know the try and catch is going to go and execute the try and if there is like any errors during executing this script the second section going to be executed so the catch will be executed only if the SQL failed to run that try so now what we have to do is to go and Define for SQL what to do if there’s like an error in your code and here we can do multiple stuff like maybe creating a logging tables and add the messages inside this table or we can go and add some nice messaging to the output like very example we can go and add like a section again over here so again some equals and we can go and repeat it over here and then add some content in between so we can start with something like to say error Accord during loading bronze layer and then we can go and add many stuff like for example we can go and add the error message and here we can go and call the function error message and we can go and add as well for example the error number so error number and of course the output of this going to be in number but the error message here is a text so we have to go and change the data type so we’re going to do a cast as in VAR Char like this and then there is like many functions that you can add to the output like for example the error States and so on so you can design what can happen if there is an error in the ETL now what else is very important in each ETL process is to add the duration of each like step so for example I would like to understand how long it takes to load this table over here but looking to the output I don’t have any informations how long is taking to load my tables and this is very important because because as you are building like a big data warehouse the ATL process is going to take long time and you would like to understand where is the issue where is the bottleneck which table is consuming a lot of time to be loaded so that’s why we have to add those informations as well to the output or even maybe to protocol it in a table so let’s go and add as well this step so we’re going to go to the start and now in order to calculate the duration you need the starting time and the end time so we have to understand when we started loaded and when we ended loading the table so now the first thing is we have to go and declare the variables so we’re going to say declare and then let’s make one called start time and the data type of this going to be the date time I need exactly the second when it started and then another one for the end time so another variable end time and as well the same thing date time so with that we have declared the variables and the next step is to go and use them so now let’s go to the first table to the customer info and at the start we’re going to say set start time equal to get date so we will get the exact time when we start loading this table and then let’s go and copy the whole thing and go to the end of loading over here so we’re going to say set this time the end time equal as well to the get dates so with that now we have the values of when we start loading this table and when we completed loading the table and now the next step is we have to go and print the duration those informations so over here we can go and say print and we can go and have as again the same design so two arrows and we can say very simply load duration and then double points and space and now what we have to do is to calculate the duration and we can do that using the date and time function date diff in order to find the interval between two dates so we’re going to say plus over here and then use date diff and here we have to Define three arguments first one is the unit so you can Define second minute hours and so on so we’re going to go with a second and then we’re going to define the start of the interval it’s going to be the start time and then the last argument is going to be the end of the boundary it’s going to be the end time and now of course the output of this going to be in number that’s why we have to go and cast it so we’re going to say cast as enar Char and then we’re going to close it like this and maybe at the ends we’re going to say plus space seconds in order to have a nice message so again what we have done we have declared the two variables and we are using them at the start we we are getting the current date and time and at the end of loading the table we are getting the current date and time and then we are finding the differences between them in order to get the load duration and in this case we are just priting this information and now we can go of course and add some nice separator between each table so I’m going to go and do it like this just few minuses not a lot of stuff so now what we have to do is to go and add this mechanism for each table in order to measure the speed of the ETL for each one of [Music] them okay so now I have added all those configurations for each table and let’s go and run the whole thing now so let’s go and edit the stor procedure this and we’re going to go and run it so let’s go and execute so now as you can see we have here one more info about the load durations and it is everywhere I can see we have zero seconds and that’s because it is super fast of loading those informations we are doing everything locally at PC so loading the data from files to database going to be Mega fast but of course in real projects you have like different servers and networking between them and you have millions of rods in the tables of course the duration going to be not like 0 seconds things going to be slower and now you can see easily how long it takes to load each of your tables and now of course what is very interesting is to understand how long it takes to load the whole pron lier so now your task is is as well to print at the ends informations about the whole patch how long it took to load the bronze [Music] layer okay I hope we are done now I have done it like this we have to Define two new variables so the start time of the batch and the end time of the batch and the first step in the start procedure is to get that date and time informations for the first variable and exactly at the end the last thing that we do in the start procedure we’re going to go and get the date and time informations for the end time so we say again set get date for the patch in time and then all what you have to do is to go and print a message so we are saying loading bronze layer is completed and then we are printing total load duration and the same thing with a date difference between the patch start time and the end time and we are calculating the seconds and so on so now what you have to do is to go and execute the whole thing so let’s go and refresh the definition of the S procedure and then let’s go and execute it so in the output we have to go to the last message and we can see loading pron layer is completed and the total load duration is as well 0 seconds because the execution time is less than 1 seconds so with that you are getting now a feeling about how to build an ETL process so as you can see the data engineering is not all about how to load the data it’s how to engineer the whole pipeline how to measure the speed of loading the data what can happen happen if there’s like an error and to print each step in your ETL process and make everything organized and cleared in the output and maybe in the logging just to make debugging and optimizing the performance way easier and there is like a lot of things that we can add we can add the quality measures and stuff so we can add many stuff to our ETL scripts to make our data warehouse professional all right my friends so with that we have developed a code in order to load the pron layer and we have tested that as well and now in the next step we we’re going to go back to draw because we want to draw a diagram about the data flow so let’s go so now what is a data flow diagram we’re going to draw a Syle visual in order to map the flow of your data where it come froms and where it ends up so we want just to make clear how the data flows through different layers of your projects and that’s help us to create something called the data lineage and this is really nice especially if you are analyzing an issue so if you have like multiple layers and you don’t have a real data lineage or flow it’s going to be really hard to analyze the scripts in order to understand the origin of the data and having this diagram going to improve the process of finding issues so now let’s go and create one okay so now back to draw and we’re going to go and build the flow diagram so we’re going to start first with the source system so let’s build the layer I’m going to go and remove the fill dotted and then we’re going to go and add like a box saying sources and we’re going to put it over here increase the size 24 and as well without any lines now what do we have inside the sources we have like folder and files so let’s go and search for a folder icon I’m going to go and take this one over here and say you are the CRM and we can as well increase the size and we have another source we have the Erp okay so this is the first layer let’s go and now have the bronze layer so we’re going to go and grab another box and we’re going to go and make the coloring like this and instead of Auto maybe take the hatch maybe something like this whatever you know so rounded and then we can go and put on top of it like the title so we can say you are the bronze layer and increase as well the size of the font so now what you’re going to do we’re going to go and add boxes for each table that we have in the bronze layer so for example we have the sales details we can go and make it little bit smaller so maybe 16 and not bold and we have other two tables from the CRM we have the customer info and as well the product info so those are the three tables that comes from the CRM and now what we’re going to do we’re going to go and connect now the source CRM with all three tables so what we going to do we’re going to go to the folder and start making arrows from the folder to the bronze layer like this and now we have to do the same thing for the Erp source so as you can see the data flow diagram shows us in one picture the data lineage between the two layers so here we can see easily those three tables actually comes from the CRM and as well those three tables in the bronze layer are coming from the Erp I understand if we have like a lot of tables it’s going to be a huge Miss but if you have like small or medium data warehouse building those diagrams going to make things really easier to understand how everything is Flowing from the sources into the different layers in your data warehouse all right so with that we have the first version of the data flow so this step is done and the final step is to commit our code in the get repo okay so now let’s go and commit our work since it is scripts we’re going to go to the folder scripts and here we’re going to have like scripts for the bronze silver and gold that’s why maybe it makes sense to create a folder for each layer so let’s go and start creating the bronze folder so I’m going to go and create a new file and then I’m going to say pron slash and then we can have the DL script of the pron layer dot SQL so now I’m going to go and paste the edal codes that we have created so those six tables and as usual at the start we have a comment where we are explaining the purpose of these scripts so we are saying these scripts creates tables in the pron schema and by running the scripts you are redefining the DL structure of the pron tables so let’s have it like that and I’m going to go and commit the changes all right so now as you can see inside the scripts we have a folder called bronze and inside it we have the ddl script for the bronze layer and as well in the pron layer we’re going to go and put our start procedure so we’re going to go and create a new file let’s call it proc load bronze. SQL and then let’s go and paste our scripts and as usual I have put it at the start an explanation about the sord procedure so we are seeing this St procedure going to go and load the data from the CSV files into the pron schema so it going go and truncate first the tables and then do a pulk inserts and about the parameters this s procedure does not accept any parameter or return any values and here a quick example how to execute it all right so I think I’m happy with that so let’s go and commit it all right my friends so with that we have committed our code into the gch and with that we are done building the pron layer so the whole is done now we’re going to go to the next one this one going to be more advanced than the bronze layer because the there will be a lot of struggle with cleaning the data and so on so we’re going to start with the first task where we’re going to analyze and explore the data in the source systems so let’s go okay so now we’re going to start with the big question how to build the silver layer what is the process okay as usual first things first we have to analyze and now the task before building anything in the silver layer we have to go and explore the data in order to understand the content of our sources once we have it what we’re going to do we will be starting coding and here the transformation that we’re going to do is data cleansing this is usually process that take really long time and I usually do it in three steps the first step is to check first the data quality issues that we have in the pron layer so before writing any data Transformations first we have to understand what are the issues and only then I start writing data transformations in order to fix all those quality issues that we have in the bronze and the last step once I have clean results what we’re going to do we’re going to go and inserted into the silver layer and those are the three faces that we will be doing as we are writing the code for the silver layer and the third step once we have all the data in the server layer we have to make sure that the data is now correct and we don’t have any quality issues anymore and if you find any issues of course what you going to do we’re going to go back to coding we’re going to do the data cleansing and again check so it is like a cycle between validating and coding once the quality of the silver layer is good we cannot skip the last phase where we going to document and commit our work in the Gs and here we’re going to have two new documentations we’re going to build the data flow diagram and as well the data integration diagram after we understood the relationship between the sources from the first step so this is the process and this is how we going to build the server layer all right so now exploring the data in the pron layer so why it is very important because understanding the data it is the key to make smart decisions in the server layer it was not the focus in the BR layer to understand the content of the data at all we focused only how to get the data to the data warehouse so that’s why we have now to take a moment in order to explore and understand the tables and as well how to connect them what are the relationship between these tables and it is very important as you are learning about a new source system is to create like some kind of documentation so now let’s go and explore the sources okay so now let’s go and explore them one by one we can start with the first one from the CRM we have the customer info so right click on it and say select top thousand rows and this is of course important if you have like a lot of data don’t go and explore millions of rows always limit your queries so for example here we are using the top thousands just to make sure that you are not impacting the system with your queries so now let’s have a look to the content of this table so we can see that we have here customer informations so we have an ID we have a key for the customer we have first name last name my Ral status gender and the creation date of the customer so simply this is a table for the customer customer information and a lot of details for the customers and here we have like two identifiers one it is like technical ID and another one it’s like the customer number so maybe we can use either the ID or the key in order to join it with other tables so now what I usually do is to go and draw like data model or let’s say integration model just to document and visual what I am understanding because if you don’t do that you’re going to forget it after a while so now we go and search for a shape let’s search for table and I’m going to go and pick this one over here so here we can go and change the style for example we can make it rounded or you can go make it sketch and so on and we can go and change the color so I’m going to make it blue then go to the text make sure to select the whole thing and let’s make it bigger 26 and then what I’m going to do for those items I’m just going to select them and go to arrange and maybe make it 40 something like this so now what we’re going to do we’re going to just go and put the table name so this is the one that we are now learning about and what I’m going to do I’m just going to go and put here the primary key I will not go and list all the informations so the primary key was the ID and I will go and remove all those stuff I don’t need it now as you can see the table name is not really friendly so I can go and bring a text and put it here on top and say this is the customer information just to make it friendly and do not forget about it and as well going to increase the size to maybe 20 something like this okay with that we have our first table and we’re going to go and keep exploring so let’s move to the second one we’re going to take the product information right click on it and select the top thousand rows I will just put it below the previous query query it now by looking to this table we can see we have product informations so we have here a primary key for the product and then we have like key or let’s say product number and after that we have the full name of the product the product costs and then we have the product line and then we have like start and end well this is interesting to understand why we have start and ends let’s have a look for example for those three rows all of those three having the same key but they have different IDs so it is the same product but with different costs so for 2011 we have the cost of 12 then 2012 we have 14 and for the last year 2013 we have 13 so it’s like we have like a history for the changes so this table not only holding the current affirmations of the product but also history informations of the products and that’s why we have those two dates start and end now let’s go back and draw this information over here so I’m just going to go and duplicate it so the name of this table going to be the BRD info and let’s go and give it like a short description current and history products information something like this just to not forget that we have history in this table and here we have as well the PRD ID and there is like nothing that we can use in order to join those two tables we don’t have like a customer ID here or in the other table we don’t have any product ID okay so that’s it for this table let’s jump to the third table and the last one in the CRM so let’s go and select I just made other queries as well short so let’s go and execute so what do you have over here we have a lot of informations about the order the sales and a lot of measures order number we have the product key so this is something that we can use in order to join it with the product table we have the customer ID we don’t have the customer key so here we have like ID and here we have key so there’s like two different ways on how to join tables and then we have here like dates the order dates the shipping date the due date and then we have the sales amount the quantity and the price so this is like an event table it is transactional table about the orders and sales and it is great table in order to connect the customers with the products and as well with the orders so let’s document this new information that we have so the table name is the sales details so we can go and describe it like this transactional records about sales and orders and now we have to go and describe how we can connect this table to the other two so we are not using the product ID we are using the product key and now we need a new column over here so you can hold control and enter or you can go over here and add a new row and the other row is going to be the customer ID so now for the the customer ID it is easy we can gr and grab an arrow in order to connect those two tables but for the product key we are not using the ID so that’s why I’m just going to go and remove this one and say product key let’s have here again a check so this is a product key it’s not a product ID and if we go and check the old table the products info you can see we are using this key and not the primary key so what we’re going to do now we will just go and Link it like this and maybe switch those two tables so I will put the customer below just perfect it looks nice okay so let’s keep moving let’s go now to the other source system we have the Erp and the first one is ARB cost and we have this cryptical name let’s go and select the data so now here it’s small table and we have only three informations so we have here something called C and then we have something I think this is the birthday and the gender information so we have here male female and so on so it looks again like the customer informations but here we have like extra data about the birthday and now if you go and compare it to the customer table that we have from the other source system let’s go and query it you can see the new table from the Erb don’t have IDs it has actually the customer number or the key so we can go and join those two tables using the customer key let’s go and document this information so I will just go and copy paste and put it here on the right side I will just go and change the color now since we are now talking about different Source system and here the table name going to be this one and the key called C ID now in order to join this table with the customer info we cannot join it with the customer ID we need the customer key that’s why here we have to go and add a new row so contrl enter and we’re going to say customer key and then we have to go and make a nice Arrow between those two keys so we’re going to go and give it a description customer information and here we have the birth dates okay so now let’s keep going we’re going to go to the next one we have the Erp location let’s go and query this table so what do you have over here we have the CID again and as you can see we have country informations and this is of course again the customer number and we have only this information the country so let’s go and docment this information this is the customer location table name going to be like this and we still have the same ID so we have here still the customer ID and we can go and join it using the customer key and we have to give it the description locate of customers and we can say here the country okay so now let’s go to the last table and explore it we have the Erp PX catalog so let’s go and query those informations so what do we have here we have like an ID a category a subcategory and the maintenance here we have like either yes and no so by looking to this table we have all the categories and the subcategories of the products and here we have like special identifier for those informations now the question is how to join it so I would like to join it actually with the product informations so let’s go and check those two tables together okay so in the products we don’t have any ID for the categories but we have these informations actually in the product key so the first five characters of the product key is actually the category ID so we can use this information over here in order to join it with the categories so we can go and describe this information like this and then we have to go and give it a name and then here we have the ID and the ID could be joined using the product key so that means for the product information we don’t need at all the product ID the primary key all what we need is the product key or the product number and what I would like to do is like to group those informations in a box so let’s go grab like any boxes here on the left side and make it bigger and then make the edges a little bit smaller let’s remove move the fill and the line I will make a dotted line and then let’s grab another box over here and say this is the CRM and we can go and increase the size maybe something like 40 smaller 35 bold and change the color to Blue and just place it here on top of this box so with that we can understand all those tables belongs to the source system CRM and we can do the same stuff for the right side as well now of course we have to go and add the description here so it’s going to be the product categories all right so with that we have now clear understanding how the tables are connected to each others we understand now the content of each table and of course it can to help us to clean up the data in the silver layer in order to prepare it so as you can see it is very important to take time understanding the structure of the tables the relationship between them before start writing any code all right so with that we have now clear understanding about the sources and with that we have as well created a data integration in the dro so with that we have more understanding about how to connect the sources and now in the next two task we will go back to SQL where we’re going to start checking the quality and as well doing a lot of data Transformations so let’s go okay so now let’s have a quick look to the specifications of the server layer so the main objective to have clean and standardized data we have to prepare the data before going to the gold layer and we will be building tables inside the silver layer and the way of loading the data from the bronze to the silver is a full load so that means we’re going to trate and then insert and here we’re going to have a lot of data Transformations so we’re going to clean the data we’re going to bring normalizations standardizations we’re going to derive new columns we will be doing as well data enrichment so a lot of things to be done in the data transformation but we will not be building any new data model so those are the specifications and we have to commit ourself to this scope okay so now building the ddl script for the layer going to be way easier than the bronze because the definition and the structure of each table in the silver going to be identical to the bronze layer we are not doing anything new so all what you have to do is to take the ddl script from the bronze layer and just go and search and replace for the schema I’m just using the notepad++ for the scripts so I’m going to go over here and say replace the bronze dots with silver dots and I’m going to go and replace all so with that now all the ddl is targeting the schema silver layer which is exactly what we need all right now before we execute our new ddl script for the silver we have to talk about something called the metadata columns they are additional columns or fields that the data Engineers add to each table that don’t come directly from the source systems but the data Engineers use it in order to provide extra informations for each record like we can add a column called create date is when the record was loaded or an update date when the the record got updated or we can add the source system in order to understand the origin of the data that we have or sometimes we can add the file location in order to understand the lineage from which file the data come from those are great tool if you have data issue in your data warehouse if there is like corrupt data and so on this can help you to track exactly where this issue happens and when and as well it is great in order to understand whether I have Gap in my data especially if you are doing incremental mod it is like putting labels on everything and you will thank yourself later when you start using them in hard times as you have an issue in your data warehouse so now back to our ddl scripts and all what you have to do is to go and do the following so for example for the first table I will go and add at the end one more extra column so it start with the prefix DW as we have defined in the naming convention and then underscore let’s have the create dates and the data tabe going to be date time to and now what we can do is we can go and add a default value for it I want the database to generate these informations automatically we don’t have to specify that in any ETL scripts so which value it’s going to be the get datee so each record going to be inserted in this table will get automatically a value from the current date and time so now as you can see the naming convention it is very important all those columns comes from the source system and only this one column comes from the data engineer of the data warehouse okay so that’s it let’s go and repeat the same thing for all other tables so I will just go and add this piece of information for each ddl all right so I think that’s it all what you have to do is now to go and execute the whole ddl script for the silver layer let’s go into that all right perfect there’s no errors let’s go and refresh the tables on the object Explorer and with that as you can see we have six tables for the silver layer it is identical to the bronze layer but we have one extra column for the metadata all right so now in the server layer before we start writing any data Transformations and cleansing we have first to detect the quality issues in the pron without knowing the issues we cannot find solution right we will explore first the quality issues only then we start writing the transformation scripts so let’s [Music] go okay so now what we’re going to do we’re going to go through all the tables over the bronze layer clean up the data and then insert it to the server layer so let’s start with the first table the first bronze table from The Source CRM so we’re going to go to the bronze CRM customer info so let’s go and query the data over here now of course before writing any data and Transformations we have to go and detect and identify the quality issues of this table so usually I start with the first check where we go and check the primary key so we have to go and check whether there are nulls inside the primary key and whether there are duplicates so now in order to detect the duplicates in the primary key what we have to do is to go and aggregate the primary key if we find any value in the primary key that exist more than once that means it is not unique and we have duplicates in the table so let’s go and write query for that so what we’re going to do we’re going to go with the customer ID and then we’re going to go and count and then we have to group up the data so Group by based on the primary key and of course we don’t need all the results we need only where we have an issue so we’re going to say having counts higher than one so we are interested in the values where the count is higher than one so let’s go and execute it now as you can see we have issue in this table we have duplicates because all those IDs exist more than one in the table which is completely wrong we should have the primary key unique and you can see as well we have three records where the primary key is empty which is as well a bad thing now there is an issue here if we have only one null it will not be here at the result so what I’m going to do I’m going to go over here and say or the primary key is null just in case if we have only one null I’m still interested to see the results so if I go and run it again we’ll get the same results so this is equality check that you can do on the table and as you can see it is not meeting the expectation so that means we have to do something about it so let’s go and create a new query so here what we’re going to do we can to start writing the query that is doing the data transformation and the data cleansing so let’s start again by selecting the [Music] data and excuse it again so now what I usually do I go and focus on the issue so for example let’s go and take one of those values and I focus on it before start writing the transformation so we’re going to say where customer ID equal to this value all right so now as you can see we have here the issue where the ID exist three times but actually we are interested only on one of them so the question is how to pick one of those usually we search for a timestamp or date value to help us so if you check the creation date over here we can understand that this record this one over here is the newest one and the previous two are older than it so that means if I have to go and pick one of those values I would like to get the latest one because it holds the most fresh information so what we have to do is we have to go and rank all those values based on the create dates and only pick the highest one so that means we need a ranking function and for that in scale we have the amazing window functions so let’s go and do that we will use the function row number over and then Partition by and here we have to divide the table by the customer ID so we’re going to divide it by the customer ID and in order now to rank those rows we have to sort the data by something so order by and as we discussed we want to sort the data by the creation date so create date and we’re going to sort it descending so the highest first then the lowest so let’s go and do that and now we’re going to go and give it the name flag last so now let’s go and executed now the data is sorted by the creation date and you can see over here that this record is the number one then the one that is older is two and the oldest one is three of course we are interested in the rank number one now let’s go and moove the filter and check everything so now if you have a look to the table you can see that on the flag we have everywhere like one and that’s because the those primary Keys exist only one but sometimes we will not have one we will have two three and so on if there’s like duplicates we can go of course and do a double check so let’s go over here and say select star from this query we’re going to say where flag last is in equal to one so let’s go and query it and now we can see all the data that we don’t need because they are causing duplicates in the primary key and they have like an old status so what we’re going to do we’re going to say equal to one and with that we guarantee that our primary key is unique and each value exist only once so if I go and query it like this you will see we will not find any duplicate inside our table and we can go and check that of course so let’s go and check this primary key and we’re going to say and customer ID equal to this value and you can see it exists now only once and we are getting the freshest data from this key so with that we have defined like transformation in order to remove any D Kates okay so now moving on to the next one as you can see in our table we have a lot of values where they are like string values now for these string values we have to check the unwanted spaces so now let’s go and write a query that’s going to detect those unwanted spaces so we’re going to say select this column the first name from our table bronze customer information so let’s go and query it now by just looking to the data it’s going to be really hard to find those unwanted spaces especially if they are at the end of the world but there is a very easy way in order to detect those issues so what we’re going to do we’re going to do a filter so now we’re going to say the first name is not equal to the first name after trimming the values so if you use the function trim what it going to do it’s going to go and remove all the leading and trailing spaces so the first name so if this value is not equal to the first name after trimming it then we have an issue so it is very simple let’s go and execute it so now in the result we will get the list of all first names where we have spaces either at the start or at the end so again the expectation here is no results and the same thing we can go and check something else like for example the last name so let’s go and do that over here and here let’s go and execute it we see in the result we have as well customers where they have like space in their last name which is not really good and we can go and keep checking all the string values that you have inside the table so for example the gender so let’s go and check that and execute now as you can see we don’t have any results that means the quality of the gender is better and we don’t have any unwanted spaces so now we have to go and write transformation in order to clean up those two columns now what I’m going to do I’m just going to go and list all the column in the query instead of the star all right so now I have a list of all the columns that I need and now what we have to do is to go to those two columns and start removing The Unwanted spaces so we’ll just use the trim it’s very simple and give it a name of course the same name and we will trim as well the last name so let’s go and query this and with that we have cleaned up those two colums from any unwanted spaces okay so now moving on we have those two informations we have the marital status and as well the gender if you check the values inside those two columns as you can see we have here low cardinality so we have limited numbers of possible values that is used inside those two columns so what we usually do is to go and check the data consistency inside those two columns so it’s very simple what we’re going to do we’re going to do the following we’re going to say distinct and we’re going to check the values let’s go and do that and now as you can see we have only three possible values either null F or M which is okay we can stay like this of course but we can make a rule in our project where we can say we will not be working with data abbreviations we will go and use only friendly full names so instead of having an F we’re going to have like a full word female and instead of M we’re going to have like male and we make it as a rule for the whole project so each time we find the gender informations we try to give the full name of it so let’s go and map those two values to a friendly one so we’re going to go to the gender of over here and say case when and we’re going to say the gender is equal to F then make it a female and when it is equal to M then M it to male and now we have to make decision about the nulls as you can see over here we have nulls so do we want to leave it as a null or we want to use always the value unknown so with that we are replacing the missing values with a standard default value or you can leave it as a null but let’s say in our project that we are replacing all the missing value with a default value so let’s go and do that we going to say else I’m going to go with the na not available or you can go with the unknown of course so that’s for the gender information like this and we can go and remove the old one and now there is one thing that I usually do in this case where sometimes what happens currently we are getting the capital F and the capital M but maybe in the the time something changed and you will get like lower M and lower F so just to make sure in those cases we still are able to map those values to the correct value what we’re going to do we’re going to just use the function upper just to make sure that if we get any lowercase values we are able to catch it so the same thing over here as well and now one more thing that you can add as well of course if you are not trusting the data because we saw some unwanted spaces in the first name and the last name you might not trust that in the future you will get here as well unwanted spaces you can go and make sure to trim everything just to make sure that you are catching all those cases so that’s it for now let’s go and excute now as you can see we don’t have an m and an F we have a full word male and female and if we don’t have a value we don’t have a null we are getting here not available now we can go and do the same stuff for the Merial status you can see as well we have only three possibil ities the S null and an M we can go and do the same stuff so I will just go and copy everything from here and I will go and use the marital status I just remove this one from here and now what are the possible values we have the S so it’s going to be single we have an M for married and we have as well a null and with that we are getting the not available so with that we are making as well data standardizations for this column so let’s go and execute it now as you can see we don’t have those short values we have a full friendly value for the status and as well for the gender and at the same time we are handling the nulls inside those two columns so with that we are done with those two columns and now we can go to the last one that create date for this type of informations we make sure that this column is a real date and not as a string or barar and as we defined it in the data type it is a date which is completely correct so nothing to do with this column and now the next step is that we’re going to go and write the insert statement so how we’re going to do it we’re going to go to the start over here and say insert into silver do SRM customer info now we have to go and specify all the columns that should be inserted so we’re going to go and type it so something like this and then we have the query over here let’s go and execute it so let’s do that so with that we have inserted clean data inside the silver table so now what we’re going to do we’re going to go and take all the queries that we have used used in order to check the quality of the bronze and let’s go and take it to another query and instead of having bronze we’re going to say silver so this is about the primary key let’s go and execute it perfect we don’t have any results so we don’t have any duplicates the same thing for the next one so the silver and it was for the first name so let’s go and check the first name and run it as you can see there is no results it is perfect we don’t have any issues you can of course go and check the last name and run it again we don’t have any result over here and now we can go and check those low cardinality columns like for example the gender let’s go and execute it so as you can see we have the not available or the unknown male and female so perfect and you can go and have a final look to the table to the silver customer info let’s go and check that so now we can have a look to all those columns as you can see everything looks perfect and you can see it is working this metadata information that we have added to the table definition now it says when we have inserted all those three cords to the table which is really amazing information to have a track and audit okay so now by looking to the script we have done different types of data Transformations the first one is with the first name and the last name here we have done trimming removing unwanted spaces this is one of the types of data cleansing so we remove unnecessary spaces or unwanted characters to to ensure data consistency now moving on to the next transformation we have this casewin so what we have done here is data normalization or we call it sometimes data standardization so this transformation is type of data cleansing where we can map coded values to meaningful userfriendly description and we have done the same transformation as well to the agender another type of transformation that we have done as well in the same case when is that we have handled the missing values so instead of nulls we can have not available so handling missing data is as well type of data cleansing where we are filling the blanks by adding for example a default value so instead of having an empty string or a null we’re going to have a default value like the not available or unknown another type of data and Transformations that we have done in this script is we have removed the duplicates so removing duplicates is as well type of data cleansing where we ensure only one record for each primary key by identifying and retaining only the most relevant role to ensure there is no duplicates inside our data and as we are removing the duplicates of course we are doing data filtering so those are the different types of data Transformations that we have done in this script all right moving on to the second table in the bronze layer from the CRM we have the product info and of course as usual before we start writing any Transformations we have to search for data quality issues and we start with the first one we have to check the primary key so we have to check whether we have duplicates or nulls inside this key so what you have to do we have to group up the data by the primary key or check whether we have nulls so let’s go and execute it so as you can see everything is safe we don’t have dcat or nulls in the primary key now moving on to the next one we have the product key here we have in this column a lot of informations so now what you have to do is to go and split this string into two informations so we are deriving new two columns so now let’s start with the first one is the category ID the first five characters they are actually the category ID and we can go and use the substring function in order to extract part of a string it needs three arguments the first one going to be the column that we want to extract from and then we have to define the position where to extract and since the first part is on the left side we going to start from the first position and then we have to specify the length so how many characters we want to extract we need five characters so 1 2 3 4 five so that’s set for the category ID category ID let’s go and execute it now as you can see we have a new column called the category ID and it contains the first part of the string and in our database from the other source system we have as well the category ID now we can go and double check just in order to make sure that we can join data together so we’re going to go and check the ID from the pron table Erp and this can be from the category so in this table we have the category ID and you can see over here those are the IDS of the category and in the C layer we have to go and join those two tables but here we still have an issue we have here an underscore between the category and the subcategory but in our table we have actually a minus so we have to replace that with an underscore in order to have matching informations between those two tables otherwise we will not be able to join the tables so we’re going to use the function replace and what we are replacing we are replacing the m with an underscore something like this and if you go now and execute it we will get an underscore exactly like the other table and of course we can go and check whether everything is matching by having very simple query where we say this new information not in and then we have this nice subquery so we are trying to find any category ID that is not available in the second table so let’s go and execute it now as you can see we have only one category that is not matching we are not finding it in this table which is maybe correct so if you go over here you will not find this category I just make it a little bit bigger so we are not finding this one category from this table which is fine so our check is okay okay so with that we have the first part now we have to go and extract the second part and we’re going to do the same thing so we’re going to use the substring and the three argument the product key but this time we will not start cutting from the first position we have to be in the middle so 1 2 2 3 4 5 6 7 so we start from the position number seven and now we have to define the length how many characters to be extracted but if you look over here you can see that we have different length of the product keys it is not fixed like the category ID so we cannot go and use specified number we have to make something Dynamic and there is Trick In order to do that we can to go and use the length of the whole column with that we make sure that we are always getting enough characters to be extra Ed and we will not be losing any informations so we will make it Dynamic like this we will not have it as a fixed length and with that we have the product key so let’s go and execute it as you can see we are now extracting the second part from this string now why we need the product key we need it in order to join it with another table called sales details so let’s go and check the sales details so let me just check the column name it is SLS product key so from bronze CRM sales let’s go and check the data over here and it looks wonderful so actually we can go and join those informations together but of course we can go and check that so we’re going to say where and we’re going to take our new column and we’re going to say not in the subquery just to make sure that we are not missing anything so let’s go and execute so it looks like we have a lot of products that don’t have any orders well I don’t have a nice feelings about it let’s go and try something like this one here and we say where LS BRD key like this value over here so I’ll just cut the last three just to search inside this table so we really don’t have such a keys let me just cut the second one so let’s go and search for it we don’t have it as well so anything that starts with the FK we don’t have any order with the product where it starts with the F key so let’s go and remove it but still we are able to join the tables right so if I go and say in instead of not in so with that you are able to match all those products so that means everything is fine actually it’s just products that don’t have any orders so with that I’m happy with this transformation now moving on to the next one we have here the name of the product we can go and check whether there is unwanted spaces so let’s go to our quality checks make sure to use the same table and we’re going to use the product name and check whether we find any unmatching after trimming so let’s go and do it well it looks really fine so we don’t have to trim anything this column is safe now moving on to the next one we have the costs so here we have numbers and we have to check the quality of the numbers so what we can do we can check whether we have nulls or negative numbers so negative costs or negative prices which is not really realistic depend on the business of course so let’s say in our business we don’t have any negative costs so it’s going to be like this let’s go and check whether is something less than zero or whether we have costs that is null so let’s go and check those informations well as you can see we don’t have any negative values but we have nulls so we can go and handle that by replacing the null with a zero of course if the business allow that so in SQL server in order to replace the null with a zero we have a very nice function called is null so we are saying if it is null then replace this value with a zero it is very simple like this and we give it a name of course so let’s go and execute it and as you can see we don’t have any more nulls we have zero which is better for the calculations if you are later doing any aggregate functions like the average now moving on to the next one we have the product line This is again abbreviation of something and the cardinality is low so let’s go and check all possible values inside this column so we’re just going to use the distinct going to be BRD line so let’s go and execute it and as you can see the possible values are null Mr rst and again those are abbreviations but in our data warehouse we have decided to give full nice names so we have to go and replace those codes those abbreviations with a friendly value and of course in order to get those informations I usually go and ask the expert from the The Source system or an expert from the process so let’s start building our case win and then let’s use the upper and as well the trim just to make sure that we are having all the cases so the BRD line is equal to so let’s start with the first value the M then we will get the friendly value it’s going to be Mountain then to the next one so I will just copy and paste here if it is an R then it is rods and another one for let me check what do we have here we have Mr and then s the S stands for other sales and we have the T so let’s go and get the T so the T stands for touring we have at the end an else for unknown not available so we don’t need any nulls so that’s it and we’re going to name it as before so product line so let’s remove the old one and let’s execute it and as you can see we don’t have here anymore those shortcuts and the abbreviations we have now full friendly value but I will go and have here like capital O it looks nicer so that we have nice friendly value now by looking to this case when as you can see it is always like we are mapping one value to another value and we are repeating all time upper time upper time and so on we have here a quick form in the case when if it is just a simple mapping so the syntax is very simple we say case and then we have the column so we are evaluating this value over here and then we just say when without the equal so if it is an M then make it Mountain the same thing for the next one and so so with that we have the functions only once and we don’t have to go and keep repeating the same function over and over and this one only if you are mapping values but if you have complex conditions you can do it like this but for now I’m going to stay with the quick form of the case wi it looks nicer and shorter so let’s go and execute it we will get the same results okay so now back to our table let’s go to the last two columns we have the start and end date so it’s like defining an interval we have start and end so let’s go and check the quality of the start and end dates we’re going to go and say select star from our bronze table and now we’re going to go and search it like this we are searching for the end date that is smaller than the starts so PRT start dates so let’s let’s go and query this so you can see the start is always like after the end which makes no sense at all so we have here data issue with those two dates so now for this kind of data Transformations what I usually do is I go and grab few examples and put it in Excel and try to think about how I’m going to go and fix it so here I took like two products this one and this one over here and for that we have like three rows for each one of them and we have this situation over here so the question now how we going to go and fix it I will go and make like a copy of one solution where we’re going to say it’s very simple let’s go and switch the start date with the end date so if I go and grab the end dates and put it at the starts things going to look way nicer right so we have the start is always younger than the end but my friends the data now makes no sense because we say it starts from 2007 and ends by 2011 the price was 12 but between 2018 and 2012 we have 14 which is not really good because if you take for example the year 2010 for 2010 it was 12 and at the same time 14 so it is really bad to have an overlapping between those two dates it should start from 2007 and end with 11 and then start febe from 12 and end with something else there should be no overlapping between years so it’s not enough to say the start should be always smaller than the end but as well the end of the first history should be younger than the start of the next records this is as well a rule in order to have no overlapping this one has no start but has already an end which is not really okay because we have always to have a starts each new record in historization has to has a start so for this record over here this is as well wrong and of course it is okay to have the start without an end so in this scenario it’s fine because this indicate this is the current informations about the costs so again this solution is not working at all so now for for the solution to what we can say let’s go and ignore completely the end date and we take only the start dates so let’s go and paste it over here but now we go and rebuild the end date completely from the start date following the rules that we have defined so the rule says the end of date of the current records comes from the start date from the next records so here this end date comes from this value over here from the next record so that means we take the next start date and put it at the end date for the previous records so with that as you can see it is working the end date is higher than the start dates and as well we are making sure this date is not overlapping with the next record but as well in order to make it way nicer we can subtract it with one so we can take the previous day like this so with that we are making sure the end date is smaller than the next start now for the next record this one over here the end date going to come from the next start date so we will take this one for here and put it as an end Ag and subtract it with one so we will get the previous day so now if you compare those two you can see it’s still higher than the start and if you compare it with the NY record this one over here it is still smaller than the next one so there is no overlapping and now for the last record since we don’t have here any informations it will be a null which is totally fine so as you can see I’m really happy with this scenario over here of course you can go and validate this with an exp from The Source system let’s say I’ve done that and they approved it and now I can go and clean up the data using this New Logic so this is how I usually brainstorm about fixing an issues if I have like a complex stuff I go and use Excel and then discuss it with the expert using this example it’s way better than showing a database queries and so on it just makees things easier to explain and as well to discuss so now how I usually do it I usually go and make a focus on only the columns that I need and take only one two scenarios while I’m building the logic and once everything is ready I go and integrate it in the query so now I’m focusing only on these columns and only for these products so now let’s go and build our logic now in SQL if you are at specific record and you want to access another information from another records and for that we have two amazing window functions we have the lead and lag in this scenario we want to access the next records that’s why we have to go with the function lead so let’s go and build it lead and then what do we need we need the lead or the start date so we want the start date of the next records and then we say over and we have to partition the data so the window going to be focusing on only one product which is the product key and not the product ID so we are dividing the data by product key and of course we have to go and sort the data so order by and we are sorting the data by the start dates and ascending so from the lowest to the highest and let’s go and give it another name so as let’s say test for example just to test the data so let’s go and execute and I think I missed something here it say Partition by so let’s go and execute again and now let’s go and check the results for the first partition over here so the start is 2011 and the end is 2012 and this information came from the next record so this data is moved to the previous record over here and the same thing for this record so the end date comes from the next record so our logic is working and the last record over here is null because we are at the end of the window and there is no next data that’s why we will get null and this is perfect of course so it looks really awesome but what is missing is we have to go and get the previous day and we can do that very simply using minus one we are just subtracting one day so we have no overlapping between those two dates and the same thing for those two dates so as you can see we have just buil a perfect end date which is way better than the original data that we got from the source system now let’s take this one over here and put it inside our query so we don’t need the end H we need our new end dat we just remove that test and execute now it looks perfect all right now we are not done yet with those two dates actually we are saying all time dates because we don’t have here any informations about the time always zero so it makes no sense to have these informations inside our data so what we can do we can do a very simple cast and we make this column as a date instead of date time so this is for the first one and as well for the next one as dates so let’s try that out and as you can see it is nicer we don’t have the time informations of course we can tell the source systems about all those issues but since they don’t provide the time it makes no sense to have date and time okay so it was a long run but we have now cleaned product informations and this is way nicer than the original product information that we got from the source CRM so if you grab the ddl of the server table you can see that we don’t have a category ID so we have product ID and product key and as well those two columns we just change the data type so it’s date time here but we have changed that to a date so that means we have to go and do few modifications to the ddl so what we going to do we’re going to go over here and say category ID and I will be using the same data type and for the start and end this time it’s going to be date and not date and time so that’s it for now let’s go ah and execute it in order to repair the ddl and this is what happen in the silver layer sometimes we have to adjust the metadata if the quality of the data types and so on is not good or we are building new derived informations in order later to integrate the data so it will be like very close to the bronze layer but with few modifications so make sure to update your ddl scripts and now the next step is that we’re going to go and insert the data into the table and now the next step we’re going to go and insert the result of this query that is cleaning up the bronze table into the silver table so as we’ done it before insert into silver the product info and then we have to go and list all the columns I’ve just prepared those columns so with that we can go and now run our query in order to insert the data so now as you can see SQL did insert the data and the very important step is now to check the quality of the silver table so we go back to our data quality checks and we go switch to the silver so let’s check the primary key there is no issues and we can go and check for example here the the trims there is as well no issue and now let’s go and check the costs it should not be negative or null which is perfect let’s go and check the data standardizations as you can see they are friendly and we don’t have any nulls and now very interesting the order of the dates so let’s go and check that as you can see we don’t have any issues and finally what I do I go and have a final look to the silver table and as we can see everything is inserted correctly in the correct color colums so all those columns comes from the source system and the last one is automatically generated from the ddl indicate when we loaded this table now let’s sit back and have a look to our script what are the different types of data Transformations that we have done here is for example over here the category ID and the product key we have derived new columns so it is when we create a new column based on calculations or transformations of an existing one so sometimes we need columns only for analytics and we cannot each time go to the source system and ask them to create it so instead of that we derive our own columns that we need for the analytics another transformation we have is that is null over here so we are handling here missing information instead of null we’re going to have a zero and one more transformation we have over here for the product line we have done here data normalization instead of having a code value we have a friendly value and as well we have handled the missing data for example over here instead of having a null we’re going to have not available all right moving on to another data transformation we have done data type casting so we are converting the data type from one to another and this considered as well to be a data transformation and now moving on to the last one we are doing as well data type casting but what’s more important we are doing data enrichment this type of transformation it’s all about adding a value to your data so we are adding a new relevant data to our data sets so those are the different types of data Transformations that we have done for this table okay so let’s keep going we have the sales details and this is the last table in the CRM so what do you have over here we have the order number and this is a string of course we can go and check whether we have an issue with the unwanted spaces so we can search whether we’re going to find something so we can say trim and something like this and let’s go and execute it so we can see that we don’t have any unwanted spaces that means we don’t have to transform this column so we can leave it as it is now the next two columns they are like keys and ideas is in order to connect it with the other tables as we learned before we are using the product key in order to connect it with the product informations and we are connecting the customer ID with the customer ID from the customer info so that means we have to go and check whether everything is working perfectly so we can go and check the Integrity of those columns where we say the product key Nots in and then we make a subquery and this time we can work with the silver layer right so we can say the product key from Silver do product info so let’s go and query this and as you can see we are not getting any issue that means all the product keys from the sales details can be used and connected with the product info the same thing we can go and check the Integrity of the customer ID and we can use not the products we can go to the customer info and the name was CST ID so let’s go and query that and the same thing we don’t have here any issues so that means we can go and connect the sales with the customers using the customer ID and we don’t have to do any Transformations for it so things looks really nice for those three columns now we come to the challenging one we have here the dates now those dates are not actual dates they are integer so those are numbers and we don’t want to have it like this we would like to clean that up we have to change the data type from integer to a DAT now if you want to convert an integer to a date we have to be careful with the values that we have inside each of those columns so now let’s check the quality for example of the order dates let’s say where order dates is less than zero for example something negative well we don’t have any negative values which is good let’s go and check whether we have any zeros well this is bad so we have here a lot of zeros now what we can do we can replace those informations with a null we can use of course the null IF function like this we can say null if and if it is zero then make it null so let’s execute it and as you can see now all those informations are null now let’s go and check again the data so now this integer has the years information at the start then the months and then the day so here we have to have like 1 2 3 4 5 so the length of each number should be H and if the length is less than eight or higher than eight then we have an issue let’s go and check that so we’re going to say or length sales order is not equal to eight that means less or higher let’s go and execute it now let’s go and check the results over here and those two informations they don’t look like dates so we cannot go and make from these informations a real dates they are just bad data and of course you can go and check the boundaries of a DAT like for example it should not be higher than for example let’s go and get this value 2050 and then I need for the month and the date so let’s go and execute it and if we just remove those informations just to make sure so we don’t have any date that is outside of the boundaries that you have in your business or you go for example and say the boundary should be not less than depend when your business started maybe something like this we are getting of course those values because they are less than n but if you have values around these dates you will get it as well in the query so we can go and add the rests so all those checks like validate the column that has date informations and it has the data type integer so again what are the issues over here we have zeros and sometimes we have like strange numbers that cannot be converted to a dates so let’s go and fix that in our query so we can say case when the sales order the order date is equal to zero or of the order date is not equal to 8 then null right we don’t want to deal with those values they are just wrong and they are not real dates otherwise we say else it’s going to be the order dates now what we’re going to do we’re going to go and convert this to a date we don’t want this as an integer so how we can do that we can go and cast it first to varar because we cannot cast from integer to date in SQL Server first you have to convert it to a varar and then from varar you go to a dates well this is how we do it in scq server so we cast it first to a varar and then we cast it to a date like this that’s it so we have end and we are using the same column name so this is how we transform an integer to a date so let’s go and query this and as you can see the order date now is a real date it is not a number so we can go and get rid of the old column now we have to go and do the same stuff for the shipping dates so we can go over here and replace everything with the shipping date and let’s go query well as you can see the shipping date is perfect we don’t have any issue with this column but still I don’t like that we found a lot of issues with the order dates so what we’re going to do just in case this happens for the shipping date in the future I will go and apply the same rules to the shipping dates oh let’s take the shipping date like this and if you don’t want to apply it now you have always to build like quality checks that runs every day in order to detect those issues and once you detect it then you can go and do the Transformations but for now I’m going to apply it right away so that is for the shipping date now we go to the due date and we will do the same test let’s go and execute it and as well it is perfect so still I’m going to apply the same rules so let’s get the D everywhere here in the query just make sure you don’t miss anything here so let’s go and execute now perfect as you can see we have the order date shipping date and due date and all of them are date and don’t have any wrong data inside those columns now still there is one more check that we can do and is that the order date should be always smaller than the shipping date or the due date because it’s makes no sense right if you are delivering an item without an order so first the order should happen then we are shipping the items so there is like an order of those dates and we can go and check that so we are checking now for invalid date orders where we going to say the order date is higher than the shipping date or we are searching as well for an order where the order date date is higher than the due dates so we going to have it like this due dates so let’s go and check well that’s really good we don’t have such a mistake on the data and the quality looks good so the order date is always smaller than the shipping date or the due dates so we don’t have to do any Transformations or cleanup okay friends now moving on to the last three columns we have the sales quantity and the price all those informations are connected to each others so we have a business rule or calculation it says the sales must be equal to quantity multiplied by the price and all sales quantity and price informations must be positive numbers so it’s not allowed to be negative zero or null so those are the business rules and we have to check the data consistency in our table does all those three informations following our rules so we’re going to start first with our rule right so we’re going to say if the sales is not equal to quantity multiplied by the price so we are searching where the result is not matching our expectation and as well we can go and check other stuff like the nulls so for example we can say or sales is null or quantity is null and the last one for the price and as well we can go and check whether they are negative numbers or zero so we can go over here and say less or equal to zero and apply it for the other columns as well so with that we are checking the calculation and as well we are checking whether we have null0 Z or negative numbers let’s go and check our informations I’m going to have here A distinct so let’s go and query it and of course we have here bad data but we can go and sort the data by the sales quantity and the price so let’s do it now by looking to the data we can see in the sales we have nulls we have negative numbers and zeros so we have all bad combinations and as well we have here bad calculations so as you can see the price here is 50 the quantity is one but the sales is two which is not correct and here we have as well wrong calculations here we have to have a 10 and here nine or maybe the price is wrong and by looking to the quantity now you can see we don’t have any nulls we don’t have any zeros or negative numbers so the quantity looks better than the sales and if you look to the prices we have nulls we have negatives and yeah we don’t have zeros so that means the quality of the sales and the price is wrong the calculation is not working and we have these scenarios now of course how I do it here I don’t go and try now to transform everything on my own I usually go and talk to an expert maybe someone from the business or from the source system and I show those scenarios and discuss and usually there is like two answers either they going to tell me you know what I will fix it in my source so I have to live with it there is incoming bad data and the bad data can be presented in the warehouse until the source system clean up those issues and the other answer you might get you know what we don’t have the budget and those data are really old and we are not going to do anything so here you have to decide either you leave it as it is or you say you know what let’s go and improve the quality of the data but here you have to ask for the experts to support you solving these issues because it really depend on their rules different rules makes different Transformations so now let’s say that we have the following rules if the sales informations are null or negative or zero then use the calculation the formula by multiplying the quality with the price and now if the prices are wrong for example we have here null or zero then go and calculate it from the sales and a quantity and if you have a price that is a minus like minus 21 a negative number then you have to go and convert it to a 21 so from a negative to a positive without any calculations so those are the rules and now we’re going to go and build the Transformations based on those rules so let’s do it step by step I will go over here and we’re going to start building the new sales so what is the rule Sals case when of course as usual if the sales is null or let’s say the sales is negative number or equal to zero or another scenario we have a sales information but it is not following the calculation so we have wrong information in the sales so we’re going to say the sales is not equal to the quantity multiplied by the price but of course we will not leave the price like this by using the function APS the absolute it’s going to go and convert everything from negative to a positive then what we have to do is to go and use the calculation so so it’s going to be the quantity multiplied by the price so that means we are not using the value that come from the source system we are recalculating it now let’s say the sales is correct and not one of those scenarios so we can say else we will go with the sales as it is that comes from the source because it is correct it’s really nice let’s go and say an end and give it the same name I will go and rename the old one here as an old value and the same for the price the quantity will not T it because it is correct so like this and now let’s go and transform the prices so again as usual we go with case wi so what are the scenarios the price is null or the price is less or equal to zero then what we’re going to do we’re going to do the calculation so it going to be the sales divided by the quantity the SLS quantity but here we have to make sure that we are not dividing by zero currently we don’t have any zeros in the quantity but you don’t know future you might get a zero and the whole code going to break so what you have to do is to go and say if you get any zero replace it with a null so null if if it is zero then make it null so that’s it now if the price is not null and the price is not negative or equal to zero then everything is fine and that’s why we’re going to have now the else it’s going to be the price as it is from The Source system so that’s it we’re going to say end as price so I’m totally happy with that let’s go and execute it and check of course so those are the old informations and those are the new transformed cleaned up informations so here previously we have a null but now we have two so two multiply with one we are getting two so the sales is here correct now moving on to the next one we have in the sales 40 but the price is two so two multiplied with one we should get two so the new sales is correct it is two and not 40 now to the next one over here the old sales is zero but if you go and multiply the four with the quantity you will get four so the sales here is not correct that’s why in the new sales we have it correct as a four and let’s go and get a minus so in this case we have a minus which is not correct so we are getting the price multiplied with one we should get here a nine and this sales here is correct now let’s go and get a scenario where the price is a null like this here so we don’t have here price but we calculated from the sales and the quantity so we divided the 10 by two and we have five so the new price is better and the same thing for the minuses so we have here minus 21 and in the output we have 21 which is correct so for now I don’t see any scenario where the data is wrong so everything looks better than before and with that we have applied the business rules from the experts and we have cleaned up the data in the data warehouse and this is way better than before because we are presenting now better data for analyzes and Reporting but it is challenging and you have exactly to understand the business so now what we’re going to do we’re going to go and copy those informations and integrate it in our query so instead of sales we’re going to get our new calculation and instead of the price we will get our correct calculation and here I’m missing the end let’s go and run the whole thing again so with that we have as well now cleaned sales quantity and price and it is following our business rules so with that we are done cleaning up the sales details The Next Step we’re going to go and inserted to the sales details but we have to go and check again the ddl so now all what you have to do is to compare those results with the ddl so the first one is the order number it’s fine the product key the customer ID but here we have an issue all those informations now are date and not an integer so we have to go and change the data type and with that we have better data type than before then the sales quantity price it is correct let’s go and drop the table and create it from scratch again and don’t forget to update your ddl script so that’s it for this and we’re going to go now and insert the results into our silver table say details and we have to go and list now all the columns I have already prepared the list of all the columns so make sure that you have the correct order of the columns so let’s go now and insert the data and with that and with that we can see that the SQL did insert data to our sales details but now very important is to check the health of the silver table so what we going to do instead here of bronze we’re going to go and switch it to Silver so let’s check over here so here always the order is smaller than the shipping and the due date which is really nice but now I’m very interested on the calculations so here we’re going to switch it from bronze to Silver and I’m going to go and get rid of all those calculations because we don’t need it this and now let’s see whether we have any issue well perfect our data is following the business rules we don’t have any nulls negative values zeros now as usual the last step the final check we will just have a final look to the table so we have the order number the product key the customer ID the three dates we have have the sales quantity and the price and of course we have our metadata column everything is perfect so now by looking to our code what are the different types of data Transformations that we are doing so in those three columns we are doing the following so at the start we are handling invalid data and this is as well type of transformation and as well at the same time we are doing data type casting so we are changing it to more correct data type and if you are looking to the sales over here then what we are doing over here is we are handling the missing data and as well the invalid data by deriving the column from already existing one and it is as well very similar for the price we are handling as well the invalid data by deriving it from specific calculation over here so those are the different types of data Transformations that you have done in these scripts all right now let’s keep moving to the next our system we have the customer AZ 12 so here we have we have like only three columns and let’s start with the ID first so here again we have the customers informations and if we go and check again our model you can see that we can connect this table with the CRM table customer info using the customer key so that means we have to go and make sure that we can go and connect those two tables so let’s go and check the other table we can go and check of course the silver layer so let’s query it and we can query both of the tables now we can see there is here like exract characters that are not included in the customer key from the CRM so let’s go and search for example for this customer over here where C ID like so we are searching for customer has similar ID now as you can see we are finding this customer but the issue is that we have those three characters in as there is no specifications or explanation why we have the nas so actually what we have to do is to go and remove those informations we don’t need it so let’s again check the data so it looks like the old data have an Nas at the start and then afterward we have new data without those three characters so we have to clean up those IDs in order to be able to connect it with other tables so we’re going to do it like this we’re going to start with the case wiin since we have like two scenarios in our data so if the C ID is like the three characters in as so if the ID start with those three characters then we’re going to go and apply transformation function otherwise eyes it’s going to stay like it is so that’s it so now we have to go and build the transformation so we’re going to use substring and then we have to define the string it’s going to be the C ID and then we have to define the position where it start cutting or extracting so we can say 1 2 3 and then four so we have to define the position number four and then we have to define the string how many characters should be extracted I will make it Dynamic so I will go with the link I will not go and count how much so we’re going to say the C ID so it looks good if it’s like an as then go and extract from the CID at the position number four the rest of the characters so let’s go and execute it and I’m missing here a comma again where we don’t have any Nas at the start and if you scroll down you can see those as well are not affected so with that we have now a nice ID to be joined with other table of course we can go and test it like this where and then we take the whole thing the whole transformation and say not in we remove of course the alas name we don’t need it and then we make very simple substring select distinct CST key the customer key from the silver table can be silver CRM cost info so that’s it let’s go and check so as you can see it is working fine so we are not able to find any unmatching data between the customer info from ERB and the CRM but of course after the transformation if you don’t use the transformation so if I just remove it like this we will find a lot of unmatching data so this means our transformation is working perfectly and we can go and remove the original value so that’s it for the First Column okay now moving on to the next field we have the birthday of their customers so the first thing to do is to check the data type it is a date so it’s fine it is not an integer or a string so we don’t have to convert anything but still there is something to check with the birth dates so we can check whether we have something out of range so for example we can go and check whether we have really old dates at the birth dates so let’s take 1900 and let’s say 24 and we can take the first date of the month so let’s go and check that well it looks like that we have customers that are older than a 100 Year well I don’t know maybe this is correct but it sounds of course strange to bit of the business of course this is Creed and he is in charge of something that is correct say hi to the kids hi kids yay and then we can go and check the other boundary where it is almost impossible to have a customer that the birthday is in the future so we can say birth date is higher than the current dates like this so let’s go and query this information well it will not work because we have to have like an or between them and now if we check the list over here we have dates that are invalid for the birth dates so all those dates they are all birthday in the future and this is totally unacceptable so this is an indicator for bad data quality of course you can go and report it to the source system in order to correct it so here it’s up to you what to do with those dates either leave it as it is as a bad data or we can go and clean that up by replacing all those dates with a null or maybe replacing only the one that is Extreme where it is 100% is incorrect so let’s go and write the transformation for that as usual we’re going to start with case whenn per dates is larger than the current date and time then null otherwise we can have an else where we have the birth dat as it is and then we have an end as birth date so let’s go and excuse it and with that we should not get any customer we the birthday in the future so that’s it for the birth dates now let’s move to the next one we have the gender now again the gender informations is localities so we have to go and check all the possible values inside this column so in order to check all the possible values we’re going to use select distinct gen from our table so let’s go and execute it and now the data doesn’t look really good so we have here a null we have an F we have here an empty string we have male female and again we have the m so this is not really good what we going to do we’re going to go and clean up all those informations in order to have only three values male female and not available so we’re going to do it like this we’re going to say case when and now we’re going to go and trim the values just to make sure there is like no empty spaces and as well I’m going to go and use the upper function just to make sure that in the future if we get any lower cases and so on we are covering all the different scenarios so case this is in F4 let’s say female then make it as female and we can go and do the same thing for the male like this so if it is an M or a male make sure it is capital letters because here we are using the upper then it is a male otherwise all other scenarios it should be not available so whether it is an empty string or nulls and so on so we have to have an end of course as gen so now let’s go and test it and check whether we have covered everything so you can see the m is now male the empty is not available the f is female the empty string or maybe spaces here is not available female going to stay as it is and the same for the male so with that we are covering all the scenarios and we are following our standards in the project so I’m going to go and cut this and put it in our original query over here so let’s go and execute the whole thing and with that we have cleaned up all those three columns now the question is did we change anything in the ddl well we didn’t change anything we didn’t introduce any new column or change any data type so that means the next step is we’re going to go and insert it in the server layer so as usual we’re going to say here insert into silver Erp the customer and then we’re going to go and list all the column names so C ID birth dat and the gender all right so let’s go and execute it and with that we can see it inserted all the data and of course the very important step as the next is to check that data quality so let’s go back to our query over here and change it from bronze to Silver so let’s go and check the silver layer well of course we are getting those very old customers but we didn’t change that we only change the birthday that is in the future and we don’t see it here in the results so that means everything is clean so for the next one let’s go and check the different genders and as you can see we have only those three values and of course we can go and take a final look to our table so you can see the C ID here the birth date the gender and then we see our metadata column and everything looks amazing so that’s it what are the different types of data Transformations that we have done first with the ID what you have done we have handled inv valid values so we have removed this part where it is not needed and the same thing goes for the birth dates we have handled as well invalid values and then for the last one for the gender we have done data normalizations by mapping the code to more friendly value and as well we have handled the missing values so those are the types that we have done in this code okay moving on to the second table we have the location informations so we have Erp location a101 so now here the task is easy because we have only two columns and if you go and check the integration model we can find our table over here so we can go and connect it together with the customer info from the other system using the CI ID with the customer key so those two informations must be matching in order to join the tables so that means we have to go and check the data so let’s go and select the data CST key from let’s go and get the silver Data customer info so let’s now if you go and check the result you can see over here that we have an issue with the CI ID there is like a minus between the characters and the numbers but the customer ID the customer number we don’t have anything that splits the characters with the numbers so if you go and join those two informations it will not be working so what we have to do we have to go and get rid of this minus because it is totally unnecessary so let’s go and fix that it’s going to be very simple so what we’re going to do we’re going to say C ID so we’re going to go and search for the m and replace it with nothing it’s very simple like this so let’s go and quer it again and with that things looks very similar to each others and as well we can go and query it so we’re going to say where our transformation is not in then we can go and use this as a subquery like this so let’s go and execute it and as you can see we are not finding any unmatching data now so that means our transformation is working and with that we can go and connect those two tables together so if I take take the transformation away you can see that we will find a lot of unmatching data so the transformation is okay we’re going to stay with it and now let’s speak about the countries now we have here multiple values and so on what I’m going to do this is low cardinality and we have to go and check all possible values inside this column so that means we are checking whether the data is consistent so we can do it like this distinct the country from our table I’m just going to go and copy it like this and as well I’m going to go s the data by the country so let’s go and check the informations now you can see we have a null we have an empty string which is really bad and then we have a full name of country and then we have as well an abbreviation of the countries well this is a mix this is not really good because sometimes we have the E and sometimes we have Germany and then we have the United Kingdom and then for the United States we have like three versions of the same information which is as well not really good so the quality of the is not really good so let’s go and work on the transformation as usual we’re going to start with the case win if trim country is equal to D then we’re going to transform it to Germany and the next one it’s going to be about the USA so if trim country is in so now let’s go and get those two values the US and the USA so us and USA then it’s going to be the United States States states so with that we have covered as well those three cases now we have to talk about the null and the empty string so we’re going to say when trim country is equal to empty string or country is null then it’s going to be not available otherwise I would like to get the country as it is so trim country just to make sure that we don’t have any leading or trailing spaces so that’s it let’s go and say this is the country so it is working and the country information is transformed and now what I’m going to do I’m going to take the whole new transformation and compare it to the old one let me just call this as old country and let’s go and query it so now we can check those value State as before so nothing did change the de is now Germany the empty string is not available the null the same thing and the United Kingdom State as like it’s like before and now we have one value for all those information so it’s only the United States so it looks perfect and with that we have cleaned as well the second column so with that we have now clean results and now the question did we change anything in the ddl well we haven’t changed anything both of them are varar so we can go now immediately and insert it into our table so insert into silver customer location and here we have to specify the columns it’s very simple the ID and the country so let’s go and execute it and as you can see we got now inserted all those values of course as a next we go and double check those informations I would just go and remove all those stuff as well here and instead of bronze let’s go with the silver so as you can see all the values of the country looks good and let’s have a final look to the table so like this so we have the IDS without the separator we have the countries and as well our metadata information so with that we have cleaned up the data for the location okay so now what are the different types of data transformation that we have done here is first we have handled invalid values so we have removed the minus with an empty string and for the country we have done data normalization so we have replaced codes with friendly values and as well at the same time we have handled missing values by replacing the empty string and null with not available and one more thing of course we have removed the unwanted spaces so those are the different types of transformation that we have done for this table okay guys now keep the energy up keep the spirit up we have to go and clean up the last table in the bronze layer and of course we cannot go and Skip anything we have to check the quality and to detect all the errors so now we have a table about the categories for the products and here we have like four columns let’s go and start with the first one the ID as you can see in our integration model we can connect this table together with the product info from the CRM using the product key and as as you remember in the silver layer we have created an extra column for that in the product info so if you go and select those data you can see we have a column called category ID and this one is exactly matching the ID that we have in this table and we have done the testing so this ID is ready to be used together with the other table so there is nothing to do over here and now for the next columns they are string and of course we can go and check whether there are any unwanted spaces so we are checking for The Unwanted spaces is so let’s go and check select star from and we’re going to go and get the same table like this here and first we are checking the category so the category is not equal to the category after trimming The Unwanted spaces so let’s go and execute it and as you can see we don’t have any results so there are no unwanted spaces let’s go and check the other column for example the subcategory the next one so let’s get the subcategory and the under query as well we don’t have anything so that means we don’t have unwanted spaces for the subcategory let’s go now and check the last column so I will just copy and paste now let’s get the maintenance and let’s go and execute and as well no results perfect we don’t have any unwanted spaces inside this table so now the next step is that we’re going to go and check the data standardizations because all those columns has low cardinality so what we’re going to do we’re going to say select this thing let’s get the cat category from our table I’ll just copy and paste it and check all values so as you can see we have the accessories bikes clothing and components everything looks perfect we don’t have to change anything in this column let’s go and check the subcategory and if you scroll down all values are friendly and nice as well nothing to change here and let’s go and check the last column the maintenance perfect we have only two values yes and no we don’t have any nulls so my friends that means this table has really nice data quality and we don’t have to clean up anything but still we have to follow our process we have to go and load it from the bronze to the silver even if we didn’t transform anything so our job is really easy here we’re going to go and say insert into silver dots Erp PX and so on and we’re going to go and Define The Columns so it’s going to be the ID the category sub category maintenance so that’s it let’s go and insert the data now as usual what we’re going to do we’re going to go and check the data so silver Erp PX let’s have a look all right so we can see the IDS are here the categories the subcategories the maintenance and we have our meta column so everything is inserted correctly all right so now I have all those queries and the insert statements for all six tables and now what is important before inserting any data we have to make sure that we are trating and emptying the table because if you run this qu twice what’s going to happen you will be inserting duplicates so first truncate the data and then do a full load insert all data so we’re going to have one step before it’s like the bronze layer we’re going to say trate table and then we will be trating the silver customer info and only after that we have to go and insert the data and of course we can go and give this nice information at the start so first we are truncating the table and then inserting so if I go and run the whole thing so let’s go and do it it will be working so if I can run it again we will not have any duplicates so we have to go and add this tip before each insert so let’s go and do that all right so I’m done with all tables so now let’s go and run everything so let’s go and execute it and we can see in the messaging everything working perfectly so with that we made all tables empty and then we inserted the data so perfect with that we have a nice script that loads the silver layer but of course like the bronze layer we’re going to put everything in one stored procedure so let’s go and do that we’ll go to the beginning over here and say create or alter procedure and we’re going to put it in the schema silver and using the naming convention load silver and we’re going to go over here and say begin and take the whole code end it is long one and give it one push with a tab and then at the end we’re going to say and perfect so we have our s procedure but we forgot here the US with that we will not have any error let’s go and execute it so the thir procedure is created if you go to the programmability and you will find two procedures load bronze and load silver so now let’s go and try it out all what you have to do is now only to execute the Silver Load silver so let’s execute the start procedure and with that we will get the same results this thir procedure now is responsible of loading the whole silver layer now of course the messaging here is not really good because we have learned in the bronze layer we can go and add many stuff like handling the error doing nce messaging catching the duration time so now your task is to pause the video take this thir procedure and go and transform it to be very similar to the bronze layer with the same messaging and all the add-ons that we have added so pause the video now I will do it as well offline and I will see you soon okay so I hope you are done and I can show you the results it’s like the bronze layer we have defined at the star few variables in order to catch the duration so we have the start time the end time patch start time and Patch end time and then we are printing a lot of stuff in order to have like nice messaging in the outut so at the start we are saying loading the server layer and then we start splitting by The Source system so loading the CRM tables and I’m going to show you only one table for now so we are setting the timer so we are saying start time get the dat date and time informations to it then we are doing the usual we are truncating the table and then we are inserting the new informations after cleaning it up and we have this nice message where we say load duration where we are finding the differences between the start time and the end time using the function dat diff and we want to show the result in the seconds so we are just printing how long it took to load this table and we’re going to go and repeat this process for all the tables and of course we are putting everything in try and Cat so the SQL going to go and try to execute the tri part and if there are any issues the SQL going to go and execute the catch and here we are just printing few information like the error message the error number and the error States and we are following exactly the same standard at the bronze layer so let’s go and execute the whole thing and with that we have updated the definition of the S procedure let’s go now and execute it so execute silver do load silver so let’s go and do that it went very fast like few than 1 second again because we are working on local machine loading the server layer loading the CRM tables and we can see this nice messaging so it start with trating the table inserting the data and we are getting the load duration for this table and you will see that everything is below 1 second and that’s because at in real project you will get of course more than 1 second so at the end we have low duration of the whole silver layer and now I have one more thing for you let’s say that you are changing the design of this thr procedure for the silver layer you are adding different types of messaging or maybe are creating logs and so on so now all those new ideas and redesigns that you are doing for the silver layer you have always to think about bringing the same changes as well in the other store procedure for the pron layer so always try to keep your codes following the same standards don’t have like one idea in One S procedure and an old idea in another one always try to maintain those scripts and to keep them all up to date following the same standards otherwise it can to be really hard for other developers to understand the cause I know that needs a lot of work and commitments but this is your job to make everything following the best practices and following the same naming convention and standards that you put for your projects so guys now we have very nice two ETL scripts one that loads the pron layer and another one for the server layer so now our data bear house is very simple all what you have to do is to run first the bronze layer and with that we are taking all the data from the CSV files from the source and we put it inside our data warehouse in the pron layer and with that we are refreshing the whole bronze layer once it’s done the next step is to run the start procedure of the servey layer so once you executed you are taking now all the data from the bronze layer transforming it cleaning it up and then loading it to the server layer and as you can see the concept is very simple we are just moving the data from one layer another layer with different tasks all right guys so as you can see in the silver layer we have done a lot of data Transformations and we have covered all the types that we have in the data cleansing so we remove duplicates data filtering handling missing data invalid data unwanted spaces casting the data types and so on and as well we have derived new columns we have done data enrichment and we have normalized a lot of data so now of course what we have not done yet business rules and logic data aggregations and data integration this is for the next layer all right my friends so finally we are done cleaning up the data and checking the quality of our data so we can go and close those two steps and now to the next step we have to go and extend the data flow diagram so let’s go okay so now let’s go and extend our data flow for the silver layer so what I’m going to do I’m just going to go and copy the whole thing and put it side by side to the bronze layer and let’s call it silver layer and the table names going to stay as before because we have like one to one like the bronze layer but what we’re going to do we’re going to go and change the coloring so I’m going to go and Mark everything and make it gray like silver and of course what is very important is to make the lineage so I’m going to go now from the bronze and take an arrow and put it to the server table and now with that we have like a lineage between three layers and you are checking this table the customer info you can understand aha this comes from the bronze layer from the customer info and as well this comes from the source system CRM so now you can see the lineage between different layers and without looking to any scripts and so on in one picture you can understand the whole projects so I don’t have to explain a lot of stuff by just looking to this picture you can understand how the data is Flowing between sources bronze layer silver layer and to the gold layer of course later so as you can see it looks really nice and clean all right so with that we have updated the data flow next we’re going to go and commit our work in the get repo so let’s go okay so now let’s go and commit our scripts we’re going to go to the folder scripts and here we have a server layer if you don’t have it of course you can go and create it so first we’re going to go and put the ddl scripts for the server layer so let’s go and I will paste the code over here and as usually we have this comment at the header explaining the purpose of this scripts so let’s go and commit our work work and we’re going to do the same thing for the start procedure that loads the silver layer so I’m going to go over here I have already file for that so let’s go and paste that so we have here our stored procedures and as usual at the start we have as well so this script is doing the ETL process where we load the data from bronze into silver so the action is to truncate the table first and then insert transformed cleans data from bronze to Silver there are no parameters at all and this is how you can use the start procedure okay so we’re going to go and commit our work and now one more thing that we want to commit in our project all those quaries that you have built to check the quality of the server layer so this time we will not put it in the scripts we’re going to go to the tests and here we’re going to go and make a new file called quality checks silver and inside it we’re going to go and paste all the queries that we have filled I just here reorganize them by the tables so here we can see all the checks that we have done during the course and at the header we have here nice comments so here we are just saying that this script is going to check the quality of the server layer and we are checking for nulls duplicates unwanted spaces invalid date range and so on so that each time you come up with a new quality check I’m going to recommend you to share it with the project and with other team in order to make it part of multiple checks that you do after running the atls so that’s it I’m going to go and put those checks in our repo and in case I come up with new check I’m going to go and update it perfect so now we have our code in our repository all right so with that our code is safe and we are done with the whole epic so we have build the silver layer now let’s go and minimize it and now we come to my favorite layer the gold layer so we’re going to go and build it the first step as usual we have to analyze and this time we’re going to explore the business objects so let’s go all right so now we come to the big question how we going to build the gold layer as usual we start with analyzing so now what we’re going to do here is to explore and understand what are the main business objects that are hidden inside our source system so as you can see we have two sources six files and here we have to identify what are the business objects once we have this understanding then we can start coding and here the main transformation that we are doing is data integration and here usually I split it into three steps the first one we’re going to go and build those business objects that we have identified and after we have a business object we have to look at it and decide what is the type of this table is it a dimension is it a fact or is it like maybe a flat table so what type of table that we have built and the last step is of course we have now to rename all the columns into something friendly and easy to understand so that our consumers don’t struggle with technical names so once we have all those steps what we’re going to do it’s time to validate what we have created so what we have to do the new data model that we have created it should be connectable and we have to check that the data integration is done correctly and once everything is fine we cannot skip the last step we have to document and as well commit our work in the git and here we will be introducing new type of documentations so we’re going to have a diagram about the data model we’re going to build a data dictionary where we going to describe the data model and of course we can extend the data flow diagram so this is our process those are the main steps that we will do in order to build the gold layer okay so what is exactly data modeling usually usually the source system going to deliver for you row data an organized messy not very useful in its current States but now the data modeling is the process of taking this row data and then organize it and structure it in meaningful way so what we are doing we are putting the data in a new friendly and easy to understand objects like customers orders products each one of them is focused on specific information and what is very important is we’re going to describe the relationship between those objects so by connecting them using lines so what you have built on the right side we call it logical data model if you compare to the left side you can see the data model makes it really easy to understand our data and the relationship the processes behind them now in data modeling we have three different stages or let’s say three different ways on how to draw a data model the first stage is the conceptual data model here the focus is only on the entity so we have customers orders products and we don’t go in details at all so we don’t specify any columns or attributes inside those boxes we just want to focus what are the entities that we have and as well the relationship between them so the conceptual data model don’t focus at all on the details it just gives the big picture so the second data model that we can build is The Logical data model and here we start specifying what are the different columns that we can find in each entity like we have the customer ID the first name last name and so on and we still draw the relationship between those entities and as well we make it clear which columns are the primary key and so on so as you can see we have here more details but one thing we don’t describe a lot of details for each column and we are not worry how exactly we going to store those tables in the database the third and last stage we have the physical data model this is where everything gets ready before creating it in the database so here you have to add all the technical details like adding for each column the data types and the length of each data type and many other database techniques and details so again if if you look to the conceptual data model it gives us the big picture and in The Logical data model we dive into details of what data we need and the physical layer model prepares everything for the implementation in the database and to be honest in my projects I only draw the conceptual and The Logical data model because drawing and building the physical data model needs a lot of efforts and time and there are many tools like in data bricks they automatically generate those models so in this project what we’re going to do we’re going to draw The Logical data model for the gold layer all right so now for analytics and specially for data warehousing and business intelligence we need a special data model that is optimized for reporting and analytics and it should be flexible scalable and as well easy to understand and for that we have two special data models the first type of data model we have the star schema it has a central fact table in the middle and surrounded by Dimensions the fact table contains transactions events and the dimensions contains descriptive informations and the relationship between the fact table in the middle and the dimensions around it forms like a star shape and that’s why we call it star schema and we have another data model called snowflake schema it looks very similar to the star schema so we have again the fact in the middle and surrounded by Dimensions but the big difference is that we break the dimensions into smaller subdimensions and the shape of this data model as you are extending the dimensions it’s going to look like a snowflake so now if you compare them side by side you can see that the star schema looks easier right so it is usually easy to understand easy to query it is really perfect for analyzes but it has one issue with that the dimension might contain duplicates and your Dimensions get bigger with the time now if you compare to the snowflake you can see the schema is more complex you so you need a lot of knowledge and efforts in order to query something from the snowflake but the main advantage here comes with the normalization as you are breaking those redundancies in small tables you can optimize the storage but to be honest who care about the storage so for this project I have chose to use the star schema because it is very commonly used perfect for reporting like for example if you’re using power pii and we don’t have to worry about the storage so that’s why we going to adapt this model to build our gold layer okay so now one more thing about those data models is that they contain two types of tables fact and dimensions so when I I say this is a fact table or a dimension table well the dimension contains descriptive informations or like categories that gives some context to your data for example a product info you have product name category subcategories and so on this is like a table that is describing the product and this we call it Dimension but in the other hand we have facts they are events like transactions they contain three important informations first you have multiple IDs from multiple dimensions then we have like the informations like when the transaction or the event did happen and the third type of information you’re going to have like measures and numbers so if you see those three types of data in one table then this is a fact so if you have a table that answers how much or how many then this is a fact but if you have a table that answers who what where then this is a dimension table so this is what dimension and fact tables all right my friends so so far in the bronze layer and in the silver layer we didn’t discuss anything about the business so the bronze and silver were very technical we are focusing on data Eng gestion we are focusing on cleaning up the data quality of the data but still the tables are very oriented to the source system now comes the fun part in the god layer where we’re going to go and break the whole data model of the sources so we’re going to create something completely new to our business that is easy to consume for business reporting and analyzes and here it is very very important to have a clear understanding of the business and the processes and if you don’t know it already at this phase you have really to invest time by meeting maybe process experts the domain experts in order to have clear understanding what we are talking about in the data so now what we’re going to do we’re going to try to detect what are the business objects that are hidden in the source systems so now let’s go and explore that all right now in order to build a new data model I have to understand first the original data model what are the main business objects that we have how things are related to each others and this is very important process in building a new model so now what I usually do I start giving labels to all those tables so if you go to the shapes over here let’s go and search for label and if you go to more icons I’m going to go and take this label over here so drag and drop it and then I’m going to go and increase maybe the size of the font so let’s go with 20 and bold just make it a little bit bigger so now by looking to this data model we can see that we have a bradu for informations in the CRM and as well in the ARP and then we have like customer informations and transactional table so now let’s focus on the product so the product information is over here we have here the current and the history product informations and here we have the categories that’s belong to the products so in our data model we have something called products so let’s go and create this label it’s going to be the products and so let’s go and give it a color to the style let’s Pi for example the red one now let’s go and move this label and put it beneath this table over here that I have like a label saying this table belongs to the objects called products now I’m going to do the same thing for the other table over here so I’m going to go and tag this table to the product as well so that I can see easily which tables from the sources does has informations about the product business object all right now moving on we have here a table called customer information so we have a lot of information about the customer we have as well in the ARB customer information where we have the birthday and the country so those three tables has to do with the object customer so that means we’re going to go and label it like that so let’s call it customer and I’m going to go and pick different color for that let’s go with the green so I will tag this table like this and the same thing for the other tables so copy tag the second table and the third table now it is very easily for me to see which table to belong to which business objects and now we have the final table over here and only one table about the sales and orders in the ARB we don’t have any informations about that so this one going to be easy let’s call it sales and let’s move it over here and as well maybe change the color of that to for example this color over here now this step is very important by building any data model in the gold layer it gives you a big picture about the things that you are going to module so now the next step with that we’re going to go and build those objects step by step so let’s start with the first objects with our customers so here we we have three tables and we’re going to start with the CRM so let’s start with this table over here all right so with that we know what are our business objects and this task is done and now in The Next Step we’re going to go back to SQL and start doing data Integrations and building completely new data model so let’s go and do that now let’s have a quick look to the gold layer specifications so this is the final stage we’re going to provide data to be consumed by reporting and Analytics and this time we will not be building tables we will be using views so that means we will not be having like start procedure or any load process to the gold layer all what you are doing is only data transformation and the focus of the data transformation going to be data integration aggregation business logic and so on and this time we’re going to introduce a new data model we will be doing star schema so those are the specifications for the gold layer and this is our scope so this time we make sure that we are selecting data from the silver layer not from the bronze because the bronze has bad data quality and the server is everything is prepared and cleaned up in order to build the good layer going to be targeting the server layer so let’s start with select star from and we’re going to go to the silver CRM customer info so let’s go and hit execute and now we’re going to go and select the columns that we need to be presented in the gold layer so let’s start selecting The Columns that we want we have the ID the key the first name I will not go and get the metadata information this only belongs to the Silver Perfect the next step is that I’m going to go and give this table an ilas so let’s go and call it CI and I’m going to make sure that we are selecting from this alas because later we’re going to go and join this table with other tables so something like this so we’re going to go with those columns now let’s move to the second table let’s go and get the birthday information so now we’re going to jump to the other system and we have to join the data by the CI ID together with the customer key so now we have to go and join the data with another table and here I try to avoid using the inner join because if the other table doesn’t have all the information about the customers I might lose customers so always start with the master table and if you join it with any other table in order to get informations try always to avoid the inner join because the other source might not have all the customers and if you do inner join you might lose customers so iend to start from the master table and then everything else is about the lift join so I’m going to say Lift join silver Erp customer a z12 so let’s give it the ls CA and now we have to join the tables so it’s going to be by C from the first table it going to be the customer key equal to ca and we have the CI ID now of course we’re going to get matching data because we checked the silver layer but if we haven’t prepared the data in the silver layer we have to do here preparation step in order to join Jo the tables but we don’t have to do that because that was a preep in the silver layer so now you can see the systematic that we have in this pron silver gold so now after joining the tables we have to go and pick the information that we need from the second table which is the birth dat so B dat and as well from this table there is another nice information it is the gender information so that’s all what we need from the second table let’s go and check the third table so the third table is about the location information the countries and as well we connect the tables by the C ID with the key so let’s go and do that we’re going to say as well left join silver Erp location and I’m going to give it the name LA and then we have to join while the keys the same thing it’s going to be CI customer key equal to La a CI ID again we have prepared those IDs and keys in the server layer so the joint should be working now we have to go and pick the data from the second table so what do we we have over here we have the ID the country and the metadata information so let’s go and just get the country perfect so now with that we have joined all the three tables and we have picked all the columns that we want in this object so again by looking over here we have joined this table with this one and this one so with that we have collected all the customer informations that we have from the two Source systems okay so now let’s go and query in order to make sure that we have everything correct and in order to understand that your joints are correct you have to keep your eye in those three columns so if you are seeing that you are getting data that means you are doing the the joints correctly but if you are seeing a lot of nulls or no data at all that means your joints are incorrect but now it looks for me it is working and another check that I do is that if your first table has no duplicates what could happen is that after doing multiple joints you might now start getting dgates because the relationship between those tables is not clear one to one you might get like one to many relationship or many to many relationships so now the check that I usually do at this stage advance I have to make sure that I don’t have duplicates from their results so we don’t have like multiple rows for the same customer so in order to do that we go and do a quick group bu so we’re going to group by the data by the customer ID and then we do the counts from this subquery so this is the whole subquery and then after that we’re going to go and say Group by the customer ID and then we say having counts higher than one so this query actually try to find out whether we have any duplicates in the primary key so let’s go and executed we don’t have any duplicate and that means after joining all those tables with the customer info those tables didn’t didn’t cause any issues and it didn’t duplicate my data so this is very important check to make sure that you are in the right way all right so that means everything is fine about the D Kates we don’t have to worry about it now we have here an integration issue so let’s go and execute it again and now if you look to the data we have two sources for the gender informations one comes from the CRM and another where come from the Erp so now the question is what are we going to do with this well we have to do data integration so let me show you how I do it first I go and have a new query and then I’m going to go and remove all other stuff and I’m going to leave only those two informations and use it distinct just to focus on the integration and let’s go and execute it and maybe as well to do an order bu so let’s do one and two let’s go and execute it again so now here we have all the scenarios and we can see sometimes there is a matching so from the first table we have female and the other table we have as well female but sometimes we have an issue like those two tables are giving different informations and the same thing over here so this is as well an issue different informations another scenario where we have a from the first table like here we have the female but in the other table we have not available well this is not a problem so we can get it from the first table but we have as well the exact opposite scenario where from the first table the data is not available but it is available from the second table and now here you might wonder why I’m getting a null over here we did handle all the missing data in the silver layer and we replace everything with not available so why we are still getting a null this null doesn’t come directly from the tables it just come because of joining tables so that means there are customers in the CRM table that is not available in the Erb table and if there is like no match what’s going to happen we will get a null from scel so this null means there was no match and that’s why we are getting this null it is not coming from the content of the tables and this is of course an issue but now the big issue what can happen for those two scenarios here we have the data but they are different and here again we have to ask the experts about it what is the master here is it the CRM system or the ARP and let’s say from their answer going to say the master data for the customer information is the CRM so that means the CRM informations are more accurate than the Erp information and this is only about the customers of course so for this scenario where we have female and male then the correct information is the female from the First Source system the same goes over here and here we have like male and female then the correct one is is the mail because this Source system is the master okay so now let’s go and build this business rule we’re going to start as usual with the case wi so the first very important rule is if we have a data in the gender information from the CRM system from the master then go and use it so we’re going to go and check the gender information from the CRM table so customer gender is not equal to not available so that means we have a value male or female let me just have here a comma like this then what going to happen go and use it so we’re going to use the value from the master CRM is the master for gender info now otherwise that means it is not available from the CRM table then go and use and grab the information from the second table so we’re going to say ca gender but now we have to be careful this null over here we have to convert it to not available as well so we’re going to use the Calis so if this is a null then go and use the not available like this so that’s it let’s have an end let me just push this over here so let’s go and call it new chin for now let’s go and excute it and let’s go and check the different scenarios all those values over here we have data from the CRM system and this is as well represented in the new column but now for the second parts we don’t have data from the first system so we are trying to get it from the second system so for the first one is not available and then we try to get it from the Second Source system so now we are activating the else well it is null and with that the CIS is activated and we are replacing the null with not available for the second scenario as well the first system don’t have the gender information that’s why we are grabbing it from the second so with that we have a female and then the third one the same thing we don’t have information but we get it from the Second Source system we have the mail and the last one it is not available in in both Source systems that’s why we are getting not available so with that as you can see we have a perfect new column where we are integrating two different Source system in one and this is exactly what we call data integration this piece of information it is way better than the source CRM and as well the source ARP it is more rich and has more information and this is exactly why we Tred to get data from different Source system in order to get rich information in the data warehouse so do we have a nice logic and as you can see it’s way easier to separate it in separate query in order first to build the logic and then take it to the original query so what I’m going to do I’m just going to go and copy everything from here and go back to our query I’m going to go and delete those informations the gender and I will put our new logic over here so a comma and let’s go and execute so with that we have our new nice column now with that we have very nice objects we don’t have delates and we have integrated data together so we took three three tables and we put it in one object now the next step is that we’re going to go and give nice friendly names the rule in the gold layer that to use friendly names and not to follow the names that we get from The Source system and we have to make sure that we are following the rules by the naming conventions so we are following the snake case so let’s go and do it step by step for the first one let’s go and call it the customer ID and then the next one I will get rid of using keys and so on I’m going to go and call it customer number because those are customer numbers then for the next one we’re going to call it first name without using any prefixes and the next one last name and we have here marital status so I will be using the exact name but without the prefix and here we just going to call it gender and this one we going to call it create date and this one birth dat and the last one going to be the country so let’s go and execute it now as you can see the names are really friendly so we have customer ID customer numbers first name last name material status gender so as you can see the names are really nice and really easy to understand now the next step I’m going to think about the order of those columns so the first two it makes sense to have it together the first name last name then I think the country is very important information so I’m going to go and get it from here and put it exactly after the last name it’s just nicer so let’s go and execute it again so the first name last name country it’s always nice to group up relevant columns together right so we have here the status of the gender and so on and then we have the CATE date and the birth date I think I’m going to go and switch the birth date with the CATE date it’s more important than the CATE dates like this and here not forget a comma so execute again so it looks wonderful now comes a very important decision about this objects is it a fact table or a dimension well as we learned Dimensions hold descriptive information about an object and as you can see we have here a descriptions about the customers so all those columns are describing the customer information and we don’t have here like transactions and events and we don’t have like measures and so on so we cannot say this object is a fact it is clearly a dimension so that’s why we’re going to go and call this object the dimension customer now there is one thing that if you creating a new dimension you need always a primary key for the dimension of course we can go over here and the depend on the primary key that we get from The Source system but sometimes you can have like Dimensions where you don’t have like a primary key that you can count on so what we have to do is to go and generate a new primary key in the data warehouse and those primary Keys we call it surrogate keys serate keys are system generated unique identifier that is assigned to each record to make the record unique it is not a business key it has no meaning and no one in the business knows about it we only use it in order to connect our data model and in this way we have more control on how to connect our data model and we don’t have to depend all way on the source system and there are different ways on how to generate surrogate Keys like defining it in the ddl or maybe using the window function row number in this data warehouse I’m going to go with a simple solution where we’re going to go and use the window function so now in order to generate a Sur key for this Dimension what we’re going to do it is very simple so we’re going to say row number over and here if we have to order by something you can order by the create date or the customer ID or the customer number whatever you want but in this example I’m going to go and order by the customer ID so we have to follow the naming convention that’s all surate keys with the key at the end as a suffix so now let’s go and query those informations and as you can see at the start we have a customer key and this is a sequence we don’t have here of course any duplicates and now this sgate key is generated in the data warehouse and we going to use this key in order to connect the data model so now with that our query is ready and the last step is that we’re going to go and create the object and as we decided all the objects in the gold layer going to be a virtual one so that means we’re going to go and create a view so we’re going to say create View gold. dim so follow damic convention stand for the dimension and we’re going to have the customers and then after that we have us so with that everything is ready let’s go and excuse it it was successful let’s go to the Views now and you can see our first objects so we have the dimension customers in the gold layer now as you know me in the next of that we’re going to go and check the quality of this new objects so let’s go and have a new query so select star from our view temp customers and now we have to make sure that everything in the right position like this and now we can do different checks like the uniqueness and so on but I’m worried about the gender information so let’s go and have a distinct of all values so as you can see it is working perfectly we have only female male and not available so that’s it with that we have our first new dimension okay friends so now let’s go and build the second object we have the products so as you can see product information is available in both Source systems as usual we’re going to start with the CRM informations and then we’re going to go and join it with the other table in order to get the category informations so those are the columns that we want from this table now we come here to a big decision about this objects this objects contains historical informations and as well the current informations now of course depend on the requirement whether you have to do analysis on the historical informations but if you don’t have such a requirements we can go and stay with only the current informations of the products so we don’t have to include all the history in the objects and it is anyway as we learned from the model over here we are not using the primary key we are using the product key so now what we have to do is to filter out the historical data and to stay only with the current data so we’re going to have here aware condition and now in order to select the current data what we’re going to do we’re going to go and Target the end dates if the end date is null that means it is a current data let’s take this example over here so you can see here we have three record for the same product key and for the first two records we have here an information in the end dates because it is historical informations but the last record over here we have it as a null and that’s because this is the current information it is open and it’s not closed yet so in order to select only the current informations it is very simple we’re going to say BRD in dat is null so if you go now and execute it you will get only the current products you will not have any history and of course we can go and add comment to it filter out all historical data and this means of course we don’t need the end date in our selection of course because it is always a null so with that we have only the current data now the next step that we have to go and join it with the product categories from the Erp and we’re going to use here the ID so as usual the master information is the CRM and everything else going to be secondary that’s why I use the Live join just to make sure I’m not losing I’m not filtering any data because if there is no match then we lose data so let’s join silver Erp and the category so let’s call it PC and now what we’re going to do we’re going to go and join it using the key so PN from the CRM we have the category ID equal to PC ID and now we have to go and pick columns from the second table so it’s going to be the PC we have the category very important PC we have the subcategory and we can go and get the maintenance so something like this let’s go and query and with that we have all those columns comes from the first table and those three comes from the second so with that we have collected all the product informations from the two Source systems now the next step is we have to go and check the quality of these results and of course what is very important is to check the uniqueness so what we’re going to do we’re going to go and have the following query I want to make sure that the product key is unique because we’re going to use it later in order to join the table with the sales so from and then we have to have group by product key and we’re going to say having counts higher than one so let’s go and check perfect we don’t have any duplicates the second table didn’t cause any duplicates for our join and as well this means we don’t have historical data and each product is only one records and we don’t have any duplicates so I’m really happy about that so let’s go in query again now of course the next step do we have anything to integrate together do we have the same information twice well we don’t have that the next step is that we’re going to go and group up the relevant informations together so I’m going to say the product ID then the product key and the product name are together so all those three informations are together and after that we can put all the category informations together so we can have the category ID the category itself the subcategory let me just query and see the results so we have the product ID key name and then we have the category ID name and the subcategory and then maybe as well to put the maintenance after the subcategory like this and I think the product cost and the line can start could stay at the end so let me just check so those three four informations about the category and then we have the cost line and the start date I’m really happy with that the next step we’re going to go and give n names friendly names for those columns so let’s start with the first one this is the product ID the next one going to be the product number we need the key for the surrogate key later and then we have the product name and after that we have the category ID and the category and this is the subcategory and then the next one going to stay as it is I don’t have to rename it the next one going to be the cost and the line and the last one will be the start dates so let’s go and execute it now we can see very nicely in the output all those friendly names for the columns and it looks way nicer than before I don’t have even to describe those informations the name describe it so perfect now the next big decision is what do we have here do we have a effect or Dimension what do you think well as you can see here again we have a lot of descriptions about the products so all those informations are describing the business object products we don’t have like here transactions events a lot of different keys and ideas so we don’t have really here a facts we have a dimension each row is exactly describing one object describing one products that’s why this is a dimension okay so now since this is a dimension we have to go and create a primary key for it well actually the surrogate key and as we have done it for the customers we’re going to go and use the window function row number in order to generate it over and then we have to S the data I will go with the start dates so let’s go with the start dates and as well the product key and we’re going to gra it a name products key like this so let’s go and execute it with that we have now generated a primary key for each product and we’re going to be using it in order to connect our data model all right now the next step we does we’re going to go and build the view so we’re going to say create view we’re going to say go and dimension products and then ask so let’s go and create our objects and now if you go and refresh the views you will see our second object the second dimension so we have here in the gold layer the dimension products and as usual we’re going to go and have a look to this view just to make sure that everything is fine so them products so let’s execute it and by looking to the data everything looks nice so with that we have now two dimensions all right friends so with that we have covered a lot of stuff so we have covered the customers and the products and we are left with only one table where we have the transactions the sales and for the sales information we have only data from the CRM we don’t have anything from the Erp so let’s go and build it okay so now I have all those informations and now of course we have only one table we don’t have to do any Integrations and so on and now we have to answer the big question do we have here a dimension or a fact well by looking to those details we can see transactions we can see events we have a lot of dates informations we have as well a lot of measures and metrics and as well we have a lot of IDs so it is connecting multiple dimensions and this is exactly a perfect setup for effects so we’re going to go and use those informations as effects and of course as we learned effect is connecting multiple Dimensions we have to present in this fact the surrogate keys that comes from the dimensions so those two informations the product key and the customer ID those informations comes from the searce system and as we learned we want to connect our data model using the surate keys so what we’re going to do we’re going to replace those two informations with the surate keys that we have generated and in order to do that we have to go and join now the two dimensions in order to get the surate key and we call this process of course data lookup so we are joining the tables in order only to get one information so let’s go and do that we will go with the lift joint of course not to lose any transaction so first we’re going to go and join it with the product key now of course in the silver layer we don’t have any ciruit Keys we have it in the good layer so that means for the fact table we’re going to be joining the server layer together with the gold layer so gold dots and then the dimension products and I’m going to just call it PR and we’re going to join the SD using the product key together with the product number [Music] from the dimension and now the only information that we need from the dimension is the key the sget key so we’re going to go over here and say product key and what I’m going to do I’m going to go and remove this information from here because we don’t need it we don’t need the original product key from The Source system we need the circuit key that we have generated in our own in this data warehouse so the same thing going to happen as well for the customer so gold Dimension customer again again we are doing here a look up in order to get the information on SD so we are joining using this ID over here equal to the customer ID because this is a customer ID and what we’re going to do the same thing we need the circuit key the customer key and we’re going to delete the ID because we don’t need it now we have the circuit key so now let’s go and execute it and now with that we have in our fact table the two keys from the dimensions and now this can help us to connect the data model to connect the facts with the dimensions so this is very necessary Step Building the fact table you have to put the surrogate keys from the dimensions in the facts so that was actually the hardest part building the facts now the next step all what you have to do is to go and give friendly names so we’re going to go over here and say order number then the surrogate keys are already friendly so we’re going to go over here and say this is the order date and the next one going to be shipping date and then the next one due date and the sales going to be I’m going to say sales amount the quantity and the final one is the price so now let’s go and execute it and look to the results so now as you can see the columns looks very friendly and now about the order of the columns we use the following schema so first in the fact table we have all the surrogate keys from the dimensions then second we have all the dates and at the end you group up all the measures and the matrics at the end of The Facts so that’s it for the query for the facts now we can go and build it so we’re going to say create a view gold in the gold layer and this time we’re going to use the fact underscore and we’re going to go and call it sales and then don’t forget about the ass so that’s it let’s go and create it perfect now we can see the facts so with that we have three objects in the gold layer we have two dimensions and one and facts and now of course the next step with this we’re going to go and check the quality of the view so let’s have a simple select fact sales so let’s execute it now by checking the result you can see it is exactly like the result from the query and everything looks nice okay so now one more trick that I usually do after building a fact is try to connect the whole data model in order to find any issues so let’s go and do that we will do just simple left join with the dimensions so gold Dimension customers C and we will use the [Music] keys and then we’re going to say where customer key is null so there is no matching so let’s go and execute this and with that as you can see in the results we are not getting anything that means everything is matching perfectly and we can do as well the same thing with the products so left join C them products p on product key and then we connect it with the facts product key and then we going to go and check the product key from the dimension like this so we are checking whether we can connect the facts together with the dimension products let’s go and check and as you can see as well we are not getting anything and this is all right so with that we have now SQL codes that is tested and as well creating the gold layer now in The Next Step as you know in our requirements we have to make clear documentations for the end users in order to use our data model so let’s go and draw a data model of the star schema so let’s go and draw our data model let’s go and search for a table and now what I’m going to do I’m going to go and take this one where I can say what is the primary key and what is the for key and I’m going to go and change little bit the design so it’s going to be rounded and let’s say I’m going to go and change to this color and maybe go to the size make it 16 and then I’m going to go and select all the columns and make it as well 16 just to increase the size and then go to our range and we can go and increase it 39 so now let’s go and zoom in a little bit for the first table let’s go and call it gold Dimension customers and make it a little bit bigger like this and now we’re going to go and Define here the primary key it is the customer key and what else we’re going to do we’re going to go and list all the columns in the dimension is little bit annoying but the results going to be awesome so what do we we have the customer ID we have the customer number and then we have the first name now in case you want a new rows so you can hold control and enter and you can go and add the other columns so now pause the video and then go and create the two Dimensions the customers and the products and add all the columns that you have built in the [Music] view welcome back so now I have those two Dimensions the third one one going to be the fact table now for the fact table I’m going to go with different color for example the blue and I’m going to go and put it in the middle something like this so we’re going to say gold fact sales and here for that we don’t have primary key so we’re going to go and delete it and I have to go and add all The Columns of the facts so order number products key customer key okay all right perfect now what we can do we can go and add the foreign key information so the product key is a foreign key key for the products so you’re going to say fk1 and the customer key going to be the foreign key for the customers so fk2 and of course you can go and increase the spacing for that okay so now after we have the tables the next step in data modeling is to go and describe the relationship between these tables this is of course very important for reporting and analytics in order to understand how I’m going to go and use the data model and we have different types of relationships we have one to one one too many and in Star schema data model the relationship between the dimension and the fact is one too many and that’s because in the table customers we have for a specific customer only one record describing the customer but in the fact table the customer might exist in multiple records and that’s because customers can order multiple times so that’s why in fact it is many and in the dimension side it is one now in order to see all those relationships we’re going to go to the menu to the left side and as you can see we have here entity relations and now you have different types of arrows so here for example we have zero to many one one to many one to one and many different types of relations so now which one we going to take we’re going to go and pick with this one so it says one mandatory so that means the customer must exist in the dimension table too many but it is optional so here we have three scenarios the customer didn’t order anything or the customer did order only once or the customer did order many things so that’s why in the fact table it is optional so we’re going to take this one and place it over here so we’re going to go and connect this part to the customer Dimension and the many parts to the facts well actually we have to do it on the customers so with that we are describing the relationship between the dimensions and fact with one to many one is mandatory for the customer Dimension and many is optional to the facts so we have the same story as well for the products so the many part to the facts and the one goes to the products so it’s going to look like this each time you are connecting new dimension to the fact table it is usually one too many relationship so you can go and add anything you want to this model like for example a text like explaining something for example if you have some complicated calculations and so on you can go and write this information over here so for example we can say over here sales calculation we can make it a little bit smaller so let’s go with 18 so we can go and write here the formula for that so sales equal quantity multipli with a price and make this a little bit bigger so it is really nice info that we can add it to the data model and even we can go and Link it to the column so we can go and take this arrow for example with it like this and Link it to the column and with that you have as well nice explanation about the business rule or the calculation so you can go and add any descriptions that you want to the data model just to make it clear for anyone that is using your data model so with that you don’t have only like three tables in the database you have as well like some kind of documentations and explanation in one Blick we can see how the data model is built and how you can connect the tables together it is amazing really for all users of your data model all right so now with that we have really nice data model and now in The Next Step we’re going to go and create quickly a data catalog all right great so with that we have a data model and we can say we have something called a data products and we will be sharing this data product with different type of users and there’s something that’s every every data product absolutely needs and that is the data catalog it is a document that can describe everything about your data model The Columns the tables maybe the relationship between the tables as well and with that you make your data product clear for everyone and it’s going to be for them way easier to derive more insights and reports from your data product and what is the most important one it is timesaving because if you don’t do that what can happen each consumer each user of your data product will keep asking you the same question questions about what do you mean with this column what is this table how to connect the table a with the table B and you will keep repeating yourself and explaining stuff so instead of that you prepare a data catalog a data model and you deliver everything together to the users and with that you are saving a lot of time and stress I know it is annoying to create a data catalog but it is Investments and best practices so now let’s go and create one okay so now in order to do that I’ve have created a new file called Data catalog in the folder documents and here what we’re going to do is very St straightforwards we’re going to make a section for each table in the gold layer so for example we have here the table dimension customers what you have to do first is to describe this table so we are saying it stores details about the customers with the demographics and Geographics data so you give a short description for the table and then after that you’re going to go and list all your columns inside this table and maybe as well the data type but what is way important is the description for each column so you give a very short description like for example here the gender of the customer now one of the best practices of describing a column is to give examples because you can understand quickly the purpose of the columns by just seeing an example right so here we are seeing we can find inside it a male female and not available so with that the consumer of your table can immediately understand uhhuh it will not be an M or an F it’s going to be a full friendly value without having them to go and query the content of the table they can understand quickly the purpose of the column so with that we have a full description for all the columns of our Dimension the same thing we’re going to do for the products so again a description for the table and as well a description for each column and the same thing for the facts so that’s it with that you have like data catalog for your data product at the code layer and with that the business user or the data analyst have better and clear understanding of the content of your gold layer all right my friends so that’s all for the data catalog in The Next Step we’re going to go back to Dro where we’re going to finalize the data flow diagram so let’s go okay so now we’re going to go and extend our data flow diagram but this time for the gold layer so now let’s go and copy the whole thing from the silver layer and put it over here side by side and of course we’re going to go and change the coloring to the gold and now we’re going to go and rename stuff so this is the gold layer but now of course we cannot leave those tables like this we have completely new data model so what do we have over here we have the fact sales we have dimension customers and as well we have Dimension products so now what I’m going to do I’m going to go and remove all those stuff we have only three tables and let’s go and put those three tables somewhere here in the center so now what you have to do is to go and start connecting those stuff I’m going to go with this Arrow over here direct connection and start connecting stuff so the sales details goes to the fact table maybe put the fact table over here and then we have the dimension customer this comes from the CRM customer our info and we have two tables from the Erp it comes from this table as well and the location from the Erp now the same thing goes for the products it comes from the product info and comes from the categories from the Erp now as you can see here we have cross arrows so what we going to do we can go and select everything and we can say line jumps with a gap and this makes it a little bit like Pitter individual for the arrows so now for example if someone asks you where the data come from for the dimension products you can open this diagram and tell them okay this comes from the silver layer we have like two tables the product info from the CRM and as well the categories from the Erp and those server tables comes from the pron layer and you can see the product info comes from the CRM and the category comes from the Erp so it is very simple we have just created a full data lineage for our data warehouse from the sources into the different layers in our data warehouse and data lineage is is really amazing documentation that’s going help not only your users but as well the developers all right so with that we have very nice data flow diagram and a data lineage all right so we have completed the data flow it’s really feel like progress like achievement as we are clicking through all those tasks and now we come to the last task in building the data warehouse where we’re going to go and commit our work in the get repo okay so now let’s put our scripts in the project so we’re going to go to the scripts over here we have here bronze silver but we don’t have a gold so let’s go and create a new file we’re going to have gold/ and then we’re going to say ddl gold. SQL so now we’re going to go and paste our views so we have here our three views and as usual at the start we going to describe the purpose of the views so we are saying create gold views this script can go and create views for the code layer and the code layer represent the final Dimension and fact tables the star schema each view perform Transformations and combination data from the server layer to produce business ready data sets and those us can be used for analytics and Reporting so that it let’s go and commit it okay so with that as you can see we have the PRS the silver so we have all our etls and scripts in the reposter and now as well for the gold layer we’re going to go and add all those quality checks that we have used in order to validate the dimensions and facts so we’re going to go to The Taste over here and we’re going to go and create a new file it’s going to be quality checks gold and the file type is SQL so now let’s go and paste our quality checks so we have the check for the fact the two dimensions and as well an explanation about the script so we are validating the integrity and the accuracy of the gold layer and here we are checking the uniqueness of the circuit keys and whether we are able to connect the data model so let’s put that as well in our git and commit the changes and in case we come up with a new quality checks we’re going to go and add it to our script here so those checks are really important if you are modifying the atls or you want to make sure that after each ATL those script SC should run and so on it is like a quality gate to make sure that everything is fine in the gold layer perfect so now we have our code in our repo story okay friends so now what you have to do is to go and finalize the get repo so for example all the documentations that we have created during the projects we can go and upload them in the docs so for example you can see here the data architecture the data flow data integration data model and so on so with that each time you edit those pages you can commit your work and you have likey version of that and another thing that you can do is that you go to the read me like for example over here I have added the project overview some important links and as well the data architecture and a little description of the architecture of course and of course don’t forget to add few words about yourself and important profiles in the different social medias all right my friends so with that we have completed our work and as well closed the last epek building the gold layer and with that we have completed all the faces of building a data warehouse everything is 100% And this feels really nice all right my friends so if you’re still here and you have built with me the data warehouse then I can say I’m really proud of you you have built something really complex and amazing because building a data warehouse is usually a very complex data projects and with that you have not only learned SQL but you have learned as well how we do a complex data projects in real world so with that you have a real knowledge and as well amazing portfolio that you can share with others if you are applying for a job or if you are showcase that you have learned something new and with that you have experienced different rules in the project what the data Architects and the data Engineers do in complex data projects so that was really an amazing journey even for me as I’m creating this project so now in the next and with that you have done the first type of data analytics projects using SQL the data warehousing now in The Next Step we’re going to do another type of projects the exploratory data analyzes Eda where we’re going to understand and explore our data sets if you like this video and you want me to create more content like this I’m going to really appreciate it if you support the channel by subscribing liking sharing commenting all those stuff going to help the Channel with the YouTube algorithm and as well my content going to reach to the others so thank you so much for watching and I will see you in the next tutorial bye

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
March 4, 2025
SQL Fundamentals: Querying, Filtering, and Aggregating Data
The text is a tutorial on SQL, a language for managing and querying data. It highlights the fundamental differences between SQL and spreadsheets, emphasizing the organized structure of data in tables with defined schemas and relationships. The tutorial introduces core SQL concepts like statements, clauses (SELECT, FROM, WHERE), and the logical order of operations. It explains how to retrieve and filter data, perform calculations, aggregate results (SUM, COUNT, AVERAGE), and use window functions for more complex data manipulation without altering the data’s structure. The material also covers advanced techniques such as subqueries, Common Table Expressions (CTEs), and joins to combine data from multiple tables. The tutorial emphasizes the importance of Boolean algebra and provides practical exercises to reinforce learning.

SQL Study Guide

Review of Core Concepts

This study guide focuses on the following key areas:
- BigQuery Data Organization: How data is structured within BigQuery (Projects, Datasets, Tables).
- SQL Fundamentals: Basic SQL syntax, clauses (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT).
- Data Types and Schemas: Understanding data types and how they influence operations.
- Logical Order of Operations: The sequence in which SQL operations are executed.
- Boolean Algebra: Using logical operators (AND, OR, NOT) and truth tables.
- Set Operations: Combining data using UNION, INTERSECT, EXCEPT.
- CASE Statements: Conditional logic for data transformation.
- Subqueries: Nested queries and their correlation.
- JOIN Operations: Combining tables (INNER, LEFT, RIGHT, FULL OUTER).
- GROUP BY and Aggregations: Summarizing data using aggregate functions (SUM, AVG, COUNT, MIN, MAX).
- HAVING Clause: Filtering aggregated data.
- Window Functions: Performing calculations across rows without changing the table’s structure (OVER, PARTITION BY, ORDER BY, ROWS BETWEEN).
- Numbering Functions: Ranking and numbering rows (ROW_NUMBER, RANK, DENSE_RANK, NTILE).
- Date and Time Functions: Extracting and manipulating date and time components.
- Common Table Expressions (CTEs): Defining temporary result sets for complex queries.
Quiz

Answer each question in 2-3 sentences.
1. Explain the relationship between projects, datasets, and tables in BigQuery.
2. What is a SQL clause and can you provide three examples?
3. Why is it important to understand data types when working with SQL?
4. Describe the logical order of operations in SQL.
5. Explain the purpose of Boolean algebra in SQL.
6. Describe the difference between UNION, INTERSECT, and EXCEPT set operators.
7. What is a CASE statement, and how is it used in SQL?
8. Explain the difference between correlated and uncorrelated subqueries.
9. Compare and contrast INNER JOIN, LEFT JOIN, and FULL OUTER JOIN.
10. Explain the fundamental difference between GROUP BY aggregations and WINDOW functions.
Quiz Answer Key
1. BigQuery organizes data hierarchically, with projects acting as top-level containers, datasets serving as folders for tables within a project, and tables storing the actual data in rows and columns. Datasets organize tables, while projects organize datasets, offering a structured way to manage and access data.
2. A SQL clause is a building block that makes up a complete SQL statement, defining specific actions or conditions. Examples include the SELECT clause to choose columns, the FROM clause to specify the table, and the WHERE clause to filter rows.
3. Understanding data types is crucial because it dictates the types of operations that can be performed on a column and determines how data is stored and manipulated, and it also avoids errors and ensures accurate results.
4. The logical order of operations determines the sequence in which SQL clauses are executed, starting with FROM, then WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT, impacting the query’s outcome.
5. Boolean algebra allows for complex filtering and conditional logic within WHERE clauses using AND, OR, and NOT operators to specify precise conditions for row selection based on truth values.
6. UNION combines the results of two or more queries into a single result set, INTERSECT returns only the rows that are common to all input queries, and EXCEPT returns the rows from the first query that are not present in the second query.
7. A CASE statement allows for conditional logic within a SQL query, enabling you to define different outputs based on specified conditions, similar to an “if-then-else” structure.
8. A correlated subquery depends on the outer query, executing once for each row processed, while an uncorrelated subquery is independent and executes only once, providing a constant value to the outer query.
9. INNER JOIN returns only matching rows from both tables, LEFT JOIN returns all rows from the left table and matching rows from the right, filling in NULL for non-matches, while FULL OUTER JOIN returns all rows from both tables, filling in NULL where there are no matches.
10. GROUP BY aggregations collapse multiple rows into a single row based on grouped values, while window functions perform calculations across a set of table rows that are related to the current row without collapsing or grouping rows.
Essay Questions
1. Discuss the importance of understanding the logical order of operations in SQL when writing complex queries. Provide examples of how misunderstanding this order can lead to unexpected results.
2. Explain the different types of JOIN operations available in SQL, providing scenarios in which each type would be most appropriate. Illustrate with specific examples related to the course material.
3. Describe the use of window functions in SQL. Include the purpose of PARTITION BY and ORDER BY. Explain some practical applications of these functions, emphasizing their ability to perform complex calculations without altering the structure of the table.
4. Discuss the use of Common Table Expressions (CTEs) in SQL. How do they improve the readability and maintainability of complex queries? Provide an example of a query that benefits from the use of CTEs.
5. Develop a SQL query using different levels of aggregations. Explain the query and explain its purpose.
Glossary of Key Terms
- Project (BigQuery): A top-level container for datasets and resources in BigQuery.
- Dataset (BigQuery): A collection of tables within a BigQuery project, similar to a folder.
- Table (SQL): A structured collection of data organized in rows and columns.
- Schema (SQL): The structure of a table, including column names and data types.
- Clause (SQL): A component of a SQL statement that performs a specific action (e.g., SELECT, FROM, WHERE).
- Data Type (SQL): The type of data that a column can hold (e.g., INTEGER, VARCHAR, DATE).
- Logical Order of Operations (SQL): The sequence in which SQL clauses are executed (FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT).
- Boolean Algebra: A system of logic dealing with true and false values, used in SQL for conditional filtering.
- Set Operations (SQL): Operations that combine or compare result sets from multiple queries (UNION, INTERSECT, EXCEPT).
- CASE Statement (SQL): A conditional expression that allows for different outputs based on specified conditions.
- Subquery (SQL): A query nested inside another query.
- Correlated Subquery (SQL): A subquery that depends on the outer query for its values.
- Uncorrelated Subquery (SQL): A subquery that does not depend on the outer query.
- JOIN (SQL): An operation that combines rows from two or more tables based on a related column.
- INNER JOIN (SQL): Returns only matching rows from both tables.
- LEFT JOIN (SQL): Returns all rows from the left table and matching rows from the right table.
- RIGHT JOIN (SQL): Returns all rows from the right table and matching rows from the left table.
- FULL OUTER JOIN (SQL): Returns all rows from both tables, matching or not.
- GROUP BY (SQL): A clause that groups rows with the same values in specified columns.
- Aggregation (SQL): A function that summarizes data (e.g., SUM, AVG, COUNT, MIN, MAX).
- HAVING (SQL): A clause that filters aggregated data.
- Window Function (SQL): A function that performs a calculation across a set of table rows that are related to the current row.
- OVER (SQL): A clause that specifies the window for a window function.
- PARTITION BY (SQL): A clause that divides the rows into partitions for window functions.
- ORDER BY (SQL): A clause that specifies the order of rows within a window function.
- ROWS BETWEEN (SQL): A clause that defines the boundaries of a window.
- Numbering Functions (SQL): Window functions that assign numbers to rows based on specified criteria (ROW_NUMBER, RANK, DENSE_RANK, NTILE).
- ROW_NUMBER() (SQL): Assigns a unique sequential integer to each row within a partition.
- RANK() (SQL): Assigns a rank to each row within a partition based on the order of the rows. Rows with equal values receive the same rank, and the next rank is skipped.
- DENSE_RANK() (SQL): Similar to RANK(), but assigns consecutive ranks without skipping.
- NTILE(n) (SQL): Divides the rows within a partition into ‘n’ approximately equal groups, assigning a bucket number to each row.
- Common Table Expression (CTE): A named temporary result set defined within a SELECT, INSERT, UPDATE, or DELETE statement.
SQL and BigQuery: A Comprehensive Guide

Okay, I will analyze the provided text excerpts and create a briefing document that summarizes the main themes, important ideas, and key facts related to SQL and BigQuery. I will include quotes from the original source where appropriate.

Briefing Document: SQL and BigQuery Fundamentals

Overview:

This document summarizes key concepts and functionalities of SQL, specifically within the context of BigQuery. The material covers data organization, query structure, data manipulation, and advanced techniques like window functions and common table expressions. The focus is on understanding the logical order of operations within SQL queries and using this understanding to write efficient and effective code.

1. Data Organization in BigQuery:
- Tables: Data is stored in tables, which consist of rows and columns, similar to spreadsheets.
- “Data in BigQuery and in SQL in general exists in the form of tables and a table looks just like this… it is a collection of rows and columns and it is quite similar to a spreadsheet…”
- Datasets: Tables are organized into datasets, analogous to folders in a file system.
- “In order to organize our tables we use data sets… a data set is just that it’s a collection of tables and it’s similar to how a folder works in a file system.”
- Projects: Datasets belong to projects. BigQuery allows querying data from other projects, including public datasets.
- “In BigQuery each data set belongs to a project… in Big Query I’m not limited to working with data that leaves in my project I could also from within my project query data that leaves in another project for example the bigquery public data is a project that is not mine…”
2. Basic SQL Query Structure:
- Statements: A complete SQL instruction, defining data retrieval and processing.
- “This is a SQL statement it is like a complete sentence in the SQL language. The statement defines where we want to get our data from and how we want to receive these data including any processing that we want to apply to it…”
- Clauses: Building blocks of SQL statements (e.g., SELECT, FROM, WHERE, GROUP BY, ORDER BY, LIMIT).
- “The statement is made up of building block blocks which we call Clauses and in this statement we have a clause for every line… the Clauses that we see here are select from where Group by having order and limit…”
- Importance of Data Types: Columns have defined data types which dictates the operations that can be performed. SQL tables can be clearly connected with each other.
- “You create a table and when creating that table you define the schema the schema is the list of columns and their names and their data types you then insert data into this table and finally you have a way to define how the tables are connected with each other…”
3. Key SQL Concepts:
- Cost Consideration: BigQuery charges based on the amount of data scanned by a query. Monitoring query size is crucial.
- “This query will process 1 kilobyte when run so this is very important because here big query is telling you how much data will be scanned in order to give you the results of this query… the amount of data that scanned by the query is the primary determinant of bigquery costs.”
- Arithmetic Operations: SQL supports combining columns and constants using arithmetic operators and functions.
- “We are able to combine columns and constants with any sort of arithmetic operations. Another very powerful thing that SQL can do is to apply functions and a function is a prepackaged piece of logic that you can apply to our data…”
- Aliases: Using aliases (AS) to rename columns or tables for clarity and brevity.
- Boolean Algebra in WHERE Clause: The WHERE clause uses Boolean logic (AND, OR, NOT) to filter rows based on conditions. Truth tables help understand operator behavior.
- “The way that these logical statements work is through something called Boolean algebra which is an essential theory for working with SQL… though the name may sound a bit scary it is really easy to understand the fundamentals of Boolean algebra now…”
- Set Operators (UNION, INTERSECT, EXCEPT): Combining the results of multiple queries using set operations. UNION combines rows, INTERSECT returns common rows, and EXCEPT returns rows present in the first table but not the second. UNION DISTINCT removes duplicate rows, while UNION ALL keeps them.
- “This command is called Union and not like stack or or something else is is that this is a set terminology right this comes from the mathematical theory of sets… and unioning means combining the values of two sets…”
4. Advanced SQL Techniques:
- CASE WHEN Statements: Creating conditional logic to assign values based on specified conditions.
- “When this condition is true we want to return the value low which is a string a piece of text that says low… all of this that you see here this is the case Clause right or the case statement and all of this is basically defining a new column in my table…”
- Subqueries: Embedding queries within other queries to perform complex filtering or calculations. Correlated subqueries are slower as they need to be recomputed for each row.
- “SQL solves this query first gets the result and then plugs that result back back into the original query to get the data we need… on the right we have something that’s called a correlated subquery and on the left we Define this as uncor related subquery…”
- Common Table Expressions (CTEs): Defining temporary named result sets (tables) within a query for modularity and readability.
- JOIN Operations: Combining data from multiple tables based on related columns. Types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
- “A full outer join is like an inner join plus a left join plus a right join…”.
- GROUP BY and Aggregation: Summarizing data by grouping rows based on one or more columns and applying aggregate functions (e.g., SUM, AVG, COUNT, MIN, MAX). The HAVING clause filters aggregated results.
- “Having you are free to write filters on aggregated values regardless of the columns that you are selecting…”.
- Window Functions: Performing calculations across a set of rows that are related to the current row without altering the table structure. They use the OVER() clause to define the window.
- “Window functions allow us to do computations and aggregations on multiple rows in that sense they are similar to what we have seen with aggregations and group bu the fundamental difference between grouping and window function is that grouping is fundamentally altering the structure of the table…”
- Numbering Functions (ROW_NUMBER, DENSE_RANK, RANK): Assigning sequential numbers or ranks to rows based on specified criteria.
- “Numbering functions are functions that we use in order to number the rows in our data according to our needs and there are several numbering functions but the three most important ones are without any doubt row number dense Rank and rank…”
5. Logical Order of SQL Operations:

The excerpts emphasize the importance of understanding the order in which SQL operations are performed. This order dictates which operations can “see” the results of previous operations. The general order is:
1. FROM (Source data)
2. WHERE (Filter rows)
3. GROUP BY (Aggregate into groups)
4. Aggregate Functions (Calculate aggregations within groups)
5. HAVING (Filter aggregated groups)
6. Window Functions (Calculate windowed aggregates)
7. SELECT (Choose columns and apply aliases)
8. DISTINCT (Remove duplicate rows)
9. UNION/INTERSECT/EXCEPT (Combine result sets)
10. ORDER BY (Sort results)
11. LIMIT (Restrict number of rows)
6. Postgress SQL Quirk

Integer Division: When dividing two integers postgress assumes that you you are doing integer Division and returns integer as well. To avoid it, at least one number needs to be floating point number.

Conclusion:

The provided text excerpts offer a comprehensive overview of SQL fundamentals and advanced techniques within BigQuery. A strong understanding of data organization, query structure, the logical order of operations, and the various functions and clauses available is crucial for writing efficient and effective SQL code. Mastering these concepts will enable users to extract valuable insights from their data and solve complex analytical problems.

BigQuery and SQL: Data Management, Queries, and Functions

FAQ on SQL and Data Management with BigQuery

1. How is data organized in BigQuery and SQL in general?

Data in BigQuery is organized in a hierarchical structure. At the lowest level, data resides in tables. Tables are collections of rows and columns, similar to spreadsheets. To organize tables, datasets are used, which are collections of tables, analogous to folders in a file system. Finally, datasets belong to projects, providing a top-level organizational unit. BigQuery also allows querying data from public projects, expanding access beyond a single project.

2. How does BigQuery handle costs and data limits?

BigQuery’s costs are primarily determined by the amount of data scanned by a query. Within the sandbox program, users can scan up to one terabyte of data each month for free. It’s important to check the amount of data that a query will process before running it, especially with large tables, to avoid unexpected charges. The query interface displays this information before execution.

3. What are the fundamental differences between SQL tables and spreadsheets?

While both spreadsheets and SQL tables store data in rows and columns, key differences exist. Spreadsheets are typically disconnected, whereas SQL provides mechanisms to define connections between tables. This allows relating data across multiple tables through defined schemas, specifying column names and data types. SQL also enforces a logical order of operations, which dictates the order in which the various parts of a query are executed.

4. How are calculations and functions used in SQL queries?

SQL allows performing calculations using columns and constants. Common arithmetic operations are supported, and functions, pre-packaged logic, can be applied to data. The order of operations in SQL follows standard arithmetic rules: brackets first, then functions, multiplication and division, and finally addition and subtraction.

5. What are Clauses in SQL, and how are they used?

SQL statements are constructed from building blocks known as Clauses. Key clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT. Clauses define where the data comes from, how it should be processed, and how the results should be presented. The clauses are assembled to form a complete SQL statement. The order in which you write the clauses is less important than the logical order in which they are executed, which is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY and LIMIT.

6. How do the WHERE clause and Boolean algebra work together to filter data in SQL?

The WHERE clause is used to filter rows based on logical conditions. These conditions rely on Boolean algebra, which uses operators like NOT, AND, and OR to create complex expressions. Understanding the order of operations within Boolean algebra is crucial for writing effective WHERE clauses. NOT is evaluated first, then AND, and finally OR.

7. What are set operations in SQL, and how are they used?

SQL provides set operations like UNION, INTERSECT, and EXCEPT to combine or compare the results of multiple queries. UNION combines rows from two or more tables, with UNION DISTINCT removing duplicate rows and UNION ALL keeping all rows, including duplicates. INTERSECT DISTINCT returns only the rows that are common to both tables. EXCEPT DISTINCT returns rows from the first table that are not present in the second table.

8. How can window functions be used to perform calculations across rows without altering the structure of the table?

Window functions perform calculations across a set of table rows related to the current row, without grouping the rows like GROUP BY. They are defined using the OVER() clause, which specifies the window of rows used for the calculation. Window functions can perform aggregations, ordering, and numbering within the defined window, adding insights without collapsing the table’s structure. Numbering functions include ROW_NUMBER, RANK, and DENSE_RANK. Numbering functions are often used in conjunction with Partition By and Order By which can divide data into logical partitions in which to number results. Ranking functions, when used with PARTITION BY and ORDER BY can define a rank, for instance, for each race result, ordered fastest to slowest. They can then be further filtered with use of a CTE, a Common Table Expression.

SQL Data Types and Schemas

In SQL, a data model is defined by the name of columns and the data type that each column will contain.
- Definition: The schema of a table includes the name of each column in the table and the data type of each column. The data type of a column defines the type of operations that can be done to the column.
- Examples of data types:
- Integer: A whole number.
- Float: A floating point number.
- String: A piece of text.
- Boolean: A value that is either true or false.
- Timestamp: A value that represents a specific point in time.
- Interval: A data type that specifies a certain span of time.
- Data types and operations: Knowing the data types of columns is important because it allows you to know which operations can be applied. For example, you can perform mathematical operations such as multiplication or division on integers or floats. For strings, you can change the string to uppercase or lowercase. For timestamps, you can subtract a certain amount of time from that moment.
SQL Tables: Structure, Schema, and Operations

In SQL, data exists in the form of tables. Here’s what you need to know about SQL tables:
- StructureA table is a collection of rows and columns, similar to a spreadsheet.
- Each row represents an entry, and each column represents an attribute of that entry. For example, in a table of fantasy characters, each row may represent a character, and each column may represent information about them such as their ID, name, class, or level.
- SchemaEach SQL table has a schema that defines the columns of the table and the data type of each column.
- The schema is assumed as a given when working in SQL and is assumed not to change over time.
- OrganizationIn SQL, tables are organized into data sets.
- A data set is a collection of tables and is similar to a folder in a file system.
- In BigQuery, each data set belongs to a project.
- Table IDThe table ID represents the full address of the table.
- The address is made up of three components: the ID of the project, the data set that contains the table, and the name of the table.
- Connections between tablesSQL allows you to define connections between tables.
- Tables can be connected with each other through arrows. These connections indicate that one of the tables contains a column with the same data as a column in another table, and that the tables can be joined using those columns to combine data.
- Table operations and clausesFROM: indicates the table from which to retrieve data.
- SELECT: specifies the columns to retrieve from the table.
- WHERE: filters rows based on specified conditions.
- DISTINCT: removes duplicate rows from the result set.
- UNION: stacks the results from multiple tables.
- ORDER BY: sorts the result set based on specified columns.
- LIMIT: limits the number of rows returned by the query.
- JOIN: combines rows from two or more tables based on a related column.
- GROUP BY: groups rows with the same values in specified columns into summary rows.
SQL Statements: Structure, Clauses, and Operations

Here’s what the sources say about SQL statements:

General Information
- In SQL, a statement is like a complete sentence that defines where to get data and how to receive it, including any processing to apply.
- A statement is made up of building blocks called clauses.
- Query statements allow for retrieving, analyzing, and transforming data.
- In this course, the focus is exclusively on query statements.
Components and Structure
- Clauses are assembled to build statements.
- There is a specific order to writing clauses; writing them in the wrong order will result in an error.
- Common clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.
Order of Execution
- The order in which clauses are written (lexical order) is not the same as the order in which they are executed (logical order).
- The logical order of execution is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT.
- The actual order of execution (effective order) may differ from the logical order due to optimizations made by the SQL engine. The course focuses on mastering the lexical order and the logical order.
Clauses and their Function
- FROM: Specifies the table from which to retrieve the data. It is always the first component in the logical order of operations because you need to source the data before you can work with it.
- SELECT: Specifies which columns of the table to retrieve. It allows you to get any columns from the table in any order. You can also use it to rename columns, define constant columns, combine columns in calculations, and apply functions.
- WHERE: Filters rows based on specified conditions. It follows right after the FROM clause in the logical order. The WHERE clause can reference columns of the tables, operations on columns, and combinations between columns.
- DISTINCT: removes duplicate rows from the result set.
Combining statements
- UNION allows you to stack the results from two or more tables. In BigQuery, you must specify UNION ALL to include duplicate rows or UNION DISTINCT to only include unique rows.
- INTERSECT returns only the rows that are shared between two tables.
- EXCEPT returns all of the elements in one table except those that are shared with another table.
- For UNION, INTERSECT, and EXCEPT, the tables must have the same number of columns, and the columns must have the same data types.
Subqueries
- Subqueries are nested queries used to perform complex tasks that cannot be done with a single query.
- A subquery is a piece of SQL logic that returns a table.
- Subqueries can be used in the FROM clause instead of a table name.
Common Table Expressions (CTEs)
- CTEs are virtual tables defined within a query that can be used to simplify complex queries and improve readability.
- CTEs are defined using the WITH keyword, followed by the name of the table and the query that defines it.
- CTEs can be used to build data pipelines within SQL code.
SQL Logical Order of Operations

Here’s what the sources say about the logical order of operations in SQL:

Basics
- The order in which clauses are written (lexical order) is not the order in which they are executed (logical order).
- Understanding the logical order is crucial for accelerating learning SQL.
- The logical order helps in building a powerful mental model of SQL that allows tackling complex and tricky problems.
The Logical Order
- The logical order of execution is: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT.
- The JOIN clause is not really separate from the FROM clause; they are the same component in the logical order of operations.
Rules for Understanding the Schema
- Operations are executed sequentially from left to right.
- Each operation can only use data that was produced by operations that came before it.
- Each operation cannot know anything about data that is produced by operations that follow it.
Implications of the Logical Order
- FROM is the very first component in the logical order of operations because the data must be sourced before it can be processed. The FROM clause specifies the table from which to retrieve the data. The JOIN clause is part of this step, as it defines how tables are combined to form the data source.
- WHERE Clause follows right after the FROM Clause. After sourcing the data, the next logical step is to filter the rows that are not needed. The WHERE clause drops all the rows that are not needed, so the table becomes smaller and easier to deal with.
- GROUP BY fundamentally alters the structure of the table. The GROUP BY operation compresses down the values; in the grouping field, a single row will appear for each distinct value, and in the aggregate field, the values will be compressed or squished down to a single value as well.
- SELECT determines which columns to retrieve from the table. The SELECT clause is where new columns are defined.
- ORDER BY sorts the result of the query. Because the ordering occurs so late in the process, SQL knows the final list of rows that will be included in the results, which is the right moment to order those rows.
- LIMIT is the very last operation. After all the logic of the query is executed and all data is computed, the LIMIT clause restricts the number of rows that are output.
Window Functions and the Logical Order
- Window functions operate on the result of the GROUP BY clause, if present; otherwise, they operate on the data after the WHERE filter is applied.
- After applying the window function, the SELECT clause is used to choose which columns to show and to label them.
Common Errors
- A common error is to try to use LIMIT to make a query cheaper. The LIMIT clause does not reduce the amount of data that is scanned; it only limits the number of rows that are returned.
- Another common error is to violate the logical order of operations. For example, you cannot use a column alias defined in the SELECT clause in the WHERE clause because the WHERE clause is executed before the SELECT clause.
- In Postgres, you cannot use the labels that you assign to aggregations in the HAVING clause.
Boolean Algebra: Concepts, Operators, and SQL Application

Here’s what the sources say about Boolean algebra:

Basics
- Boolean algebra is essential for working with SQL and other programming languages.
- It is fundamental to how computers work.
- It is a simple way to understand the fundamentals.
Elements
- In Boolean algebra, there are only two elements: true and false.
- A Boolean field in SQL is a column that can only have these two values.
Operators
- Boolean algebra has operators that transform elements.
- The three most important operators are NOT, AND, and OR.
Operations and Truth Tables
- In Boolean algebra, operations combine operators and elements and return elements.
- To understand how a Boolean operator works, you have to look at its truth table.
NOT Operator
- The NOT operator works on a single element, such as NOT TRUE or NOT FALSE.
- The negation of p is the opposite value.
- NOT TRUE is FALSE
- NOT FALSE is TRUE
AND Operator
- The AND operator connects two elements, such as TRUE AND FALSE.
- If both elements are true, then the AND operator will return true; otherwise, it returns false.
OR Operator
- The OR operator combines two elements.
- If at least one of the two elements is true, then the OR operator returns true; only if both elements are false does it return false.
Order of Operations
- There is an agreed-upon order of operations that helps solve complex expressions.
- The order of operations is:
1. Brackets (solve the innermost brackets first)
2. NOT
3. AND
4. OR
Application in SQL
- A complex logical statement that is plugged into the WHERE filter isolates only certain rows.
- SQL converts statements in the WHERE filter to true or false, using values from a row.
- SQL uses Boolean algebra rules to compute a final result, which is either true or false.
- If the result computes as true for the row, then the row is kept; otherwise, the row is discarded.
Example

To solve a complex expression, such as NOT (TRUE OR FALSE) AND (FALSE OR TRUE), proceed step by step:
1. Solve the innermost brackets:
- TRUE OR FALSE is TRUE
- FALSE OR TRUE is TRUE
1. The expression becomes: NOT (TRUE) AND (TRUE)
2. Solve the NOT:
- NOT (TRUE) is FALSE
1. The expression becomes: FALSE AND TRUE
2. Solve the AND:
- FALSE AND TRUE is FALSE
1. The final result is FALSE
Intuitive SQL For Data Analytics – Tutorial

Data Analytics FULL Course for Beginners to Pro in 29 HOURS – 2025 Edition

The Original Text

learn SQL for analytics Vlad is a data engineer and in this course he covers both the theory and the practice so you can confidently solve hard SQL challenges on your own no previous experience required and you’ll do everything in your browser using big query hi everyone my name is Vlad and I’m a date engineer welcome to intuitive SQL for analytics this here is the main web page for the course you will find it in the video description and this will get updated over time with links and resources so be sure to bookmark it now the goal of this course is to quickly enable you to use SQL to analyze and manipulate data this is arguably the most important use case for SQL and the Practical objective is that by the end of this course you should be able to confidently solve hard SQL problems of the kind that are suggested during data interviews the course assumes no previous knowledge of SQL or programming although it will be helpful if you’ve work with spreadsheets such as Microsoft Excel or Google Sheets because there’s a lot of analogies between manipulating data in spreadsheets and doing it in SQL and I also like to use spreadsheets to explain SQL Concepts now there are two parts to this course theory and practice the theory part is a series of short and sweet explainers about the fundamental concepts in SQL and for this part we will use Google bigquery bigquery which you can see here is a Google service that allows you to upload your own data and run SQL on top of it so in the course I will teach you how to do that and how to do it for free you won’t have to to spend anything and then we will load our data and we will run SQL code and besides this there will be drawings and we will also be working with spreadsheets and anything it takes to make the SQL Concepts as simple and understandable as possible the practice part involves doing SQL exercises and for this purpose I recommend this website postest SQL exercises this is a free and open-source website where you will find plenty of exercises and you will be able to run SQL code to solve these exercises check your answer and then see a suggested way to do it so I will encourage you to go here and attempt to solve these exercises on your own however I have also solved 42 of these exercises the most important ones and I have filmed explainers where I solve the exercise break it apart and then connect it to the concepts of the course so after you’ve attempted the exercise you will be able to see me solving it and connect it to the rest of the course so how should you take this course there are actually many ways to do it and you’re free to choose the one that works best if you are a total beginner I recommend doing the following you should watch the theory lectures and try to understand everything and then once you are ready you should attempt to do the exercises on your own on the exercise uh website that I’ve shown you here and if you get stuck or after you’re done you can Watch How I solved the exercise but like I said this is just a suggestion and uh you can combine theory and practice as you wish and for example a more aggressive way of doing this course would be to jump straight into the exercises and try to do them and every time that you are stuck you can actually go to my video and see how I solved the exercise and then if you struggle to understand the solution that means that maybe there’s a theoretical Gap and then you can go to the theory and see how the fundamental concepts work so feel free to experiment and find the way that works best for you now let us take a quick look at the syllabus for the course so one uh getting started this is a super short explainer on what SQL actually is and then I teach you how to set up bigquery the Google service where we will load our data and run SQL for the theory part the second uh chapter writing your first query so here I explained to you how big query works and how you can use it um and how you are able to take your own data and load it in big query so you can run SQL on top of it and at the end of it we finally run our first SQL query chapter 3 is about exploring some ESS IAL SQL Concepts so this is a short explainer of how data is organized in SQL how the SQL statement Works meaning how we write code in SQL and here is actually the most important concept of the whole course the order of SQL operations this is something that is not usually taught properly and a lot of beginners Miss and this causes a lot of trouble when you’re you’re trying to work with SQL so once you learn this from the start you will be empowered to progress much faster in your SQL knowledge and then finally we get into the meat of the course this is where we learn all the different components in SQL how they work and how to combine them together so this happens in a few phases in the first phase we look at the basic components of SQL so these are uh there’s a few of them uh there’s select and from uh there’s learning how to transform columns the wear filter the distinct Union order by limit and then finally we see how to do simple aggregations at the end of this part you will be empowered to do the first batch of exercises um don’t worry about the fact that there’s no links yet I will I will add them but this is basically involves going to this post SQL exercises website and going here and doing this uh first batch of exercises and like I said before after you’ve done the exercises you can watch the video of me also solving them and breaking them down next we take a look at complex queries and this involves learning about subqueries and Common Table expressions and then we look at joining tables so here is where we understand how SQL tables are connected uh with each other and how we can use different types of joints to bring them together and then you are ready for the second batch of exercises which are those that involve joints and subqueries and here there are eight exercises the next step is learning about aggregations in SQL so this involves the group bu the having and window functions and then finally you are ready for the final batch of exercises which actually bring together all the concepts that we’ve learned in this course and these are 22 exercises and like before for each exercise you have a video for me solving it and breaking it apart and then finally we have the conclusion in the conclusion we see how we can put all of this knowledge together and then we take a look at how to use this knowledge to actually go out there and solve SQL challenges such as the ones that are done in data interviews and then here you’ll find uh all the resources that are connected to the course so you have the files with our data you have the link to the spreadsheet that we will use the exercises and all the drawings that we will do this will definitely evolve over over time as the course evolves so bookmark this page and keep an eye on it that was that was all you needed to know to get started so I will see you in the course if you are working with SQL or you are planning to work with SQL you’re certainly a great company in the 2023 developer survey by stack Overflow there is a ranking of the most popular Technologies out there if we look at professional developers where we have almost 70,000 responses we can see that SQL is ranked as the third most popular technology SQL is certainly one of the most in demand skills out there not just for developers but for anyone who works with data in any capacity and in this course I’m going to help you learn SQL the way I wish I would have learned it when I started out on my journey since this is a practical course we won’t go too deep into the theory all you need to know for our purposes is that SQL is a language for working with data like most languages SQL has several dialects you may have heard of post SQL or my sqil for example you don’t need to worry about these dialects because they’re all very similar so if you learn SQL in any one of the dialects you’ll do well on all the others in this course we will be working with B query and thus we will write SQL in the Google SQL dialect here is the documentation for Google big query the service that we will use to write SQL code in this course you can see that big query uses Google SQL a dialect of SQL which is an compliant an compliant means that Google SQL respects the generally recognized standard for creating SQL dialects and so it is highly compatible with with all other common SQL dialects as you can read here Google SQL supports many types of statements and statements are the building blocks that we use in order to get work done with SQL and there are several types of statements listed here for example query statements allow us to retrieve and analyze and transform data data definition language statements allow us to create and modify database objects such as tables and Views whereas data manipulation language statements allows us to update and insert and delete data from our tables now in this course we focus exclusively on query statements statements that allow us to retrieve and process data and the reason for this is that if you’re going to start working with big query you will most likely start working with this family of statements furthermore query statements are in a sense the foundation for all other families of statements so if you understand uh query statements you’ll have no trouble learning the others on your own why did I pick big query for this course I believe that the best way to learn is to load your own data and follow questions that interest you and play around with your own projects and P query is a great tool to do just that first of all it is free at least for the purposes of learning and for the purposes of this course it has a great interface that will give you U really good insights into your data and most importantly it is really easy to get started you don’t have to install anything on your computer you don’t have to deal with complex software you just sign up for Google cloud and you’re ready to go and finally as you will see next big query gives you many ways to load your own data easily and quickly and get started writing SQL right away I will now show you how you can sign up for Google cloud and get started with bigquery so it all starts with this link which I will share in the resources and this is the homepage of Google cloud and if you don’t have an account with Google Cloud you can go here and select sign in and here you need to sign in with your Google account which you probably have but if you don’t you can go here and select create account so I have now signed in with my Google account which you can see here in the upper right corner and now I get a button that says start free so I’m going to click that and now I get taken to this page and on the right you see that the first time you sign up for Google Cloud you get $300 of free credits so that you can try the services and that’s pretty neat and here I have to enter some extra information about myself so I will keep it as is and agree to the terms of service and continue finally I need to do the payment information verification so unfortunately this is something I need to do even though I’m not going to be charged for the services and this is for Google to be able to verify my my identity so I will pick individual as account type and insert my address and finally I need to add a payment method and again uh I need to do this even though I’m not going to pay I will actually not do it here because I don’t intend to sign up but after you are done you can click Start my free trial and then you should be good to go now your interface may look a bit different but essentially after you’ve signed up for Google Cloud you will need to create a project and the project is a tool that organizes all your work in Google cloud and essentially every work that you do in Google cloud has to happen inside a specific project now as you can see here there is a limited quota of projects but that’s not an issue because we will only need one project to work in this course and of course creating a new project is totally free so I will go ahead and give it a name and I don’t need any organization and I will simply click on create once that’s done I can go back back to the homepage for Google cloud and here as you can see I can select a project and here I find the project that I have created before and once I select it the rest of the page won’t change but you will see the name of the project in the upper bar here now although I’ve created this project as an example for you for the rest of the course you will see me working within this other project which was the one that I had originally now I will show you how you can avoid paying for Google cloud services if you don’t want to so from the homepage you have the search bar over here and you can go here and write billing and click payment overview to go to the billing service now here on the left you will see your billing account account which could be called like this or have another name and clicking here I can go to manage billing accounts now here I can go to my projects Tab and I see a list of all of my projects in Google cloud and a project might or might not be connected to a billing account if a project is not connected to a billing account then then Google won’t be able to charge you for this project although keep in mind that if you link your project with a billing account and then you incur some expenses if you then remove the billing account you will still owe Google Cloud for those uh expenses so what I can do here is go to my projects and on actions I can select disabled building in case I have a billing account connected now while this is probably the shest way to avoid incurring any charges you will see that you will be severely limited in what you can do in your project if that project is not linked to any billing account however you should still be able to do most of what you need to do in B query at least for this course and we can get more insight into how that works by by going to the big query pricing table so this page gives us an overview of how pricing works for big query I will not analyze this in depth but what you need to know is that when you work with bigquery you can fundamentally be charged for two things one is compute pricing and this basically means all the data that bigquery scans in order to return the results that you need when you write your query and then you have storage pricing which is the what you pay in order to store your data inside bigquery now if I click on compute pricing I will go to the pricing table and here you can select the region that uh most reflects where you are located and I have selected Europe here and as you can see you are charged $625 at the time of this video for scanning a terabyte of data however the first terabyte per month is free so every month you can write queries that scan one terabyte of data and not pay for them and as you will see more in detail this is more than enough for what we will be doing in this course and also for for what you’ll be doing on your own in order to experiment with SQL and if I go back to the top of the page and then click on storage pricing you can see here that again you can select your region and see um several pricing uh units but here you can see that the first 10 gab of storage per month is free so you can put up to 10 gigabytes of data in B query and you won’t need a billing account you won’t pay for storage and this is more than enough for our needs in order to learn SQL in short bigquery gives us a pretty generous free allowance for us to load data and play with it and we should be fine however I do urge you to come back to this page and read it again because things may have changed since I recorded this video video to summarize go to the billing service check out your billing account and you have the option to decouple your project from the billing account to avoid incurring any charges and you should still be able to use B query but as a disclaimer I cannot guarantee that things will work just the same uh at the time that you are watching this video so be sure to check the documentation or maybe discuss with Google Cloud support to um avoid incurring any unexpected expenses please do your research and be careful in your usage of these services for this course I have created an imaginary data set with the help of chat GPT the data set is about a group of fantasy characters as well as their items and inventories I then proceed proed to load this data into bigquery which is our SQL system I also loaded it into Google Sheets which is a spreadsheet system similar to Microsoft Excel this will allow me to manipulate the data visually and help you develop a strong intuition about SQL operations I’m going to link a separate video which explains how you can also use chat PT to generate imaginary data according to your needs and then load this data in Google Sheets or bigquery I will also link the files for this data in the description which you can use to reproduce this data on your side next I will show you how we can load the data for this course into bigquery so I’m on the homepage of Google cloud and I have a search bar up here and I can write big query and select it from here and this will take me to the big query page now there is a panel on the left side that appears here if I hover or it could be fixed and this is showing you several tools that you can use within bigquery and you can see that we are in the SQL workspace and this is actually the only tool that we will need for this course so if you if you’re seeing this panel on the left I recommend going to this arrow in the upper left corner and clicking it so you can disable it and make more room for yourself now I want to draw your attention to the Explorer tab which shows us where our data is and how it is organized so I’m going to expand it here now data in bigquery and in SQL in general exists in the form of tables and a table looks just like this as you can see here the customer’s table it is a collection of rows and columns and it is quite similar to a spreadsheet so this will be familiar to you if you’ve ever worked with Microsoft Excel or Google Sheets or any spreadsheet program so your data is actually living in a table and you could have as many tables as you need in B query there could be quite a lot of them so in order to organize our tables we use data sets for example in this case my data is a data set which contains the table customers and employee data and a data set is is just that it’s a collection of tables and it’s similar to how a folder Works in a file setem system it is like a for folder for tables finally in bigquery each data set belongs to a project so you can see here that we have two data sets SQL course and my data and they both belong to this project idelic physics and so on and this is actually the ID of my project this is the ID of the project that I’m working in right now the reason the Explorer tab shows the project as well is that in big query I’m not limited to working with data that leaves in my project I could also from within my project query data that leaves in another project for example the bigquery public data is a project that is not mine but it’s actually a public project by bigquery and if I expand this you will see that it contains a collection of of several data sets which are in themselves um collections of tables and I would be able to query these uh tables as well but you don’t need to worry about that now because in this course we will only focus on our own data that lives in our own project so this in short is how data is organized in big query now for the purpose of this course I recommend creating a new data set so so that our tables can be neatly organized and to do that I can click the three dots next to the project uh ID over here and select create data set and here I need to pick a name for the data set so I will call this fantasy and I suggest you use the same name because if you do then the code that I share with you will work immediately then as for the location you can select the multi region and choose the region that is closest to you and finally click on create data set so now the data set fantasy has been created and if I try to expand it here I will see that it is empty because I haven’t loaded any data yet the next step is to load our tables so I assume that you have downloaded the zip file with the tables and extracted it on your local computer and then we can select the action point here next to the fantasy data set and select create table now as a source I will select upload and here I will click on browse and access the files that I have downloaded and I will select the first table here here which is the characters table the file format is CSV so Google has already understood that and scrolling down here I need to choose a name for my table so I will call it just like the file uh which is characters and very important under schema I need to select autodetect and we will see what this means in a bit but basically this is all we need so now I will select create table and now you will see that the characters table has appeared under the fantasy data set and if I click on the table and then go on preview I will should be able to see my data I will now do the same for the other two tables so again create table source is upload file is inventory repeat the name and select autod detect and I have done the same with the third table so at the end of this exercise the fantasy data set should have three tables and you can select them and go on preview to make sure that the data looks as expected now our data is fully loaded and we are ready to start querying it within big query now let’s take a look at how the bigquery interface works so on the left here you can see the Explorer which shows all the data that I have access to and so to get a table in big query first of all you open the name of the project and then you look at the data sets that are available within this project you open a data set and finally you see a table such as characters and if I click now on characters I will open the table view now in the table view I will find a lot of important information about my table in these tabs over here so let’s look at the first tab schema the schema tab shows me the structure of my table which as we shall see is very important and the schema is defined essentially by two things the name of each column in my table and the data type of each column so here we see that the characters table contains a few columns such as ID name Guild class and so on and these columns have different data types for example ID is an integer which means that it contains natural numbers whereas name is string which means that it contains text and as we shall see the schema is very important because it defines what you can do with the table and next we have the details tab which contains a few things first of all is the table ID and this ID represents the full address of the table and this address is made up of three components first of all you have the ID of the project which is as you can see the project in which I’m working and it’s the same that you see here on the left in the Explorer tab the next component is the data set that contains the table and again you see it in the Explorer Tab and finally you have the name of the table this address is important because it’s what we use to reference the table and it’s what we use to get data from this table and then we see a few more things about the table such as when it was created when it was last modified and here we can see the storage information so we can see here that this table has 15 rows and on the dis it occupies approximately one kilobyte if you work extensively with P query this information will be important for two reasons number one it defines how much you are paying every month to store this table and number two it defines how much you would pay for a query that scans all the data in this table and as we have seen in the lecture on bigquery pricing these are the two determinants of bigquery costs however for the purpose of this course you don’t need to worry about this because the tables we are working with are so small that they won’t put a dent in your free month monthly allowance for using big query next we have the preview tab which is really cool to get a sense of the data and this basically shows you a graphical representation of your table and as you will notice it looks very similar to a spreadsheet so you can see our columns the same ones that we saw in the schema tab ID name Guild and so on and as you remember we saw that ID is an integer column so you can only contain numbers name is a text column and then you see that this table has 15 rows and because it’s such a small table all of it fits into this graphical representation but in the real world you may have tables with millions of rows and in this case the preview will show you only a small portion of that table table but still enough to get a good sense of the data now there are a few more tabs in the table view we have lineage data profile data quality but I’m not going to look at them now because they are like Advanced features in bigquery and you won’t need them in this course instead I will run a very basic query on this table and this is not for the purpose of understanding query that will come soon it is for the purpose of showing you what the interface looks like after you run a query so I have a very basic query here that will run on my table and you can see that the interface is telling me how much data this query will process and this is important because this is the main determinant of cost in bigquery every query scans a certain amount of data and you have to pay for that but as we saw in the lecture of bigquery pricing this table is so small that you could run a million or more of these queries and not exhaust your monthly allowance so if you see 1 kilobyte you don’t have to worry about that so now I will click run and my query will execute and here I get the query results view this is the view that that appears after you have successfully run a query so we have a few tabs here and the first step that you see is results and this shows you graphically the table that was returned by your query so as we shall see every query in SQL runs on a table and returns a table and just like the preview tab showed you a graphical view of your table the results tab shows you a graphical view of the table that your query has returned and this is really the only tab in the query results view that you will need on this course the other ones show different features or more advanced features that we won’t look at but feel free to explore them on your own if you are curious but what’s also important in this view is this button over here save results which you can use to EXP report the result of your query towards several different destinations such as Google drive or local files on your computer in different formats or another big query table a spreadsheet in Google Sheets or even copying them to your clipboard so that you can paste them somewhere else but we shall discuss this more in detail in the lecture on getting data in and out of big query finally if you click on this little keyboard icon up here you can see a list of shortcuts that you can use in the big query interface and if you end up running a lot of queries and you want to be fast this is a nice way to improve your experience with big query so be sure to check these out we are finally ready to write our first query and in the process we will keep exploring the Fantastic bigquery interface so one way to get started would be to click this plus symbol over here so that we can open a new tab now to write the query the first thing I will do is to tell big query where the data that I want leaves and to do that I will use the from Clause so I will simply write from and my data lives in the fantasy data set and in the characters table next I will tell SQL what data I actually want from this table and the simplest thing to ask for is to get all the data and I can do this by writing select star now my query is ready and I can either click run up here or I can press command enter on my Mac keyboard and the query will run and here I get a new tab which shows me the results now the results here are displayed as a table just as uh we saw in the preview tab of the table and I can get an idea of uh my results and this is actually the whole table because this is what I asked for in the query there are also other ways to see the results which are provided by bigquery such as Json which shows the same data but in a different format but we’re not going to be looking into that for this course one cool option that the interface provides is if I click on this Arrow right here in my tab I can select split tab to right and now I have a bit of less room in my interface but I am seeing the table on the left and the query on the right so that I can look at the structure of the table while writing my query for example if I click on schema here I could see which columns I’m able to um reference in my query and that can be pretty handy I could also click this toggle to close the Explorer tab temporarily if I don’t need to look look at those tables so I can make a bit more room or I can reactivate it when needed I will now close this tab over here go back to the characters table and show you another way that I can write a query which is to use this query command over here so if I click here I can select whether I want my query in a new tab or in a split tab let let me say in new tab and now bigquery has helpfully uh written a temp template for a query that I can easily modify in order to get my data and to break down this template as you can see we have the select Clause that we used before we have the from clause and then we have a new one called limit now the from Clause is doing the same job as before it is telling query where we want to get our data but you will notice that the address looks a bit different from the one that I had used specifically I used the address fantasy. characters so what’s happening here is that fantasy. characters is a useful shorthand for the actual address of the table and what we see here that big query provided is the actual full address of the table or in other words it is the table ID and as you remember the table ID indicates the project ID the data set name and the table name and importantly this ID is usually enclosed by back ticks which are a quite specific character long story short if you want to be 100% sure you can use the full address of the table and bigquery will provide it for you but if you are working within the same project where the data lives so you don’t need to reference the project you can also use this shorthand here to make writing the address easier and in this course I will use these two ways to reference a table interchangeably I will now keep the address that bigquery provided now the limit statement as we will see is simply limiting the number of rows that will be returned by this query no more than 1,000 rows will be returned and next to the select we have to say what data we want to get from this table and like before I can write star and now my query will be complete before we run our query I want to draw your attention to this message over here this query will process 1 kilobyte when run so this is very important because here big query is telling you how much data will be scanned in order to give you the results of this query in this case we are returning um all the data in the table therefore all of the table will be scanned and actually limit does not have any influence on that it doesn’t reduce how much data is scanned so this query will scan 1 kilobyte of data and the amount of data that scanned by the query is the primary determinant of bigquery costs now as you remember we are able to scan up to one terabyte of data each month within the sandbox program and if we wanted to scan more data then we would have to pay so the question is how many of these queries could we run before running out of our free allowance well to answer that we could check how many kilobytes are in a terabyte and if you Google this the conversion says it’s one to um multipli by 10 to the power of 9 which ends up being 1 billion therefore we can run 1 billion of these queries each month before running out of our allowance now you understand why I’ve told you that as long as you work with small tables you won’t really run out of your allowance and you don’t really have to worry about costs however here’s an example of a query that will scan a large amount of data and what I’ve done here is I’ve taken one of the public tables provided by big query which I’ve seen to be quite large and I have told big query to get me all the data for this table and as you can see here big query says that 120 gabt of data will be processed once this query runs now you would need about eight of these queries to get over your free allowance and if you had connected to B query you could also be charged money for any extra work that you do so be very careful about this and if you work with large tables always check this message over here before running the query and remember you won’t actually be charged until you actually hit run on the query and there you have it we learned how the big query interface works and wrote our first SQL query it is important that we understand how data is organized in SQL so we’ve already seen a a preview of the characters table and we’ve said that this is quite similar to how you would see data in a spreadsheet namely you have a table which is a collection of rows and columns and then in this case on every row you have a character and for every character you have a number of information points such as their ID their name their class level and so on the first fundamental difference with the spreadsheet is that if I want to have some data in a spreadsheet I can just open a new one and uh insert some data in here right so ID level name and so on then I could say that I have a character id one who is level 10 and his name is Gandalf and this looks like the data I have in SQL and I can add some more data as well well a new character id 2 level five and the name is frao now I will save this spreadsheet and then some days later someone else comes in let’s say a colleague and they want to add some new data and they say oh ID uh is unknown level is um 20.3 and the name here and then I also want to uh show their class so I will just add another column here and call this Mage now spreadsheets are of course extremely flexible because you can always um add another column and write in more cells and you can basically write wherever you want but this flexibility comes at a price because the more additions we make to this uh to the data model that is represented here the more complex it will get with time and the more likely it will be that we make confusions or mistakes which is what actually happens in real life when people work with spreadsheets SQL takes a different approach in SQL before we insert any actual data we have to agree on the data model that we are going to use and the data model is essentially defined by two elements the name of our columns and the data type that each column will contain for example we can agree that we will use three columns in our table ID level and name and then we can agree that ID will be an integer meaning that it will contain contain whole numbers level will be a integer as well and name will be a string meaning that it contains text now that we’ve agreed on this structure we can start inserting data on the table and we have a guarantee that the structure will not change with time and so any queries that we write on top of this table any sort of analysis that we create for this table will also be durable in time because it will have the guarantee that the data model of the table will not change and then if someone else comes in and wants to insert this row they will actually not be allowed to first of all because they are trying to insert text into an integer column and so they’re violating the data type of the column and they are not allowed to do that in level they are also violating the data type of the column because this column only accepts whole numbers and they’re trying to put a floating Point number in there and then finally there are also violating the column definition because they’re they’re trying to add a column class that was not actually included in our data model and that we didn’t agree on so the most important difference between spreadsheets and SQL is that for each SQL table you have a schema and as we’ve seen before the schema defines exactly which columns our table has and what is the data type of each column so in this case for the characters table we have several columns uh and here we can see their names and then each column has a specific data types and all the most important data types are actually represented here specifically by integer we mean a whole number and by float we mean a floating Point number string is a piece of text Boolean is a value that is either true or false and time stamp is a value that represents a specific point in time all of this information so the number of columns the name of each column and the type of each column they constitute the schema of the table and like we’ve said the schema is as assumed as a given when working in SQL and it is assumed that will not change over time now in special circumstances there are ways to alter the schema of a table but it is generally assumed as a given when writing queries and we shall do the same in this course and why is it important to keep track of the data type why is it important to distinguish between integer string Boolean the simple answer is that the data type defines the type of operations that you you can do to a column for example if you have an integer or a float you can multiply the value by two or divide it and so on if you have a string you can turn that string to uppercase or lowercase if you have a time stamp you can subtract 30 days from that specific moment in time and so on so by looking at the data type you can find out what type of work you can do with a column the second fundamental difference from spreadsheets is that spreadsheets are usually disconnected but SQL has a way to define connections between tables so what we see here is a representation of our three tables and for each table you can see the schema meaning the list of columns and their types but the extra information that we have here is the connection between the tables so you can see that the inventory table is connected to the items table and also to the character table moreover the characters table is connected with itself now we’re not going to explore this in depth now because I don’t want to add too much Theory we will see this in detail in the chapter on joints but it is a fundamental difference from spreadsheets that SQL tables can be clearly connected with each other and that’s basically all you need to understand how data is organized in SQL for now you create a table and when creating that table you define the schema the schema is the list of columns and their names and their data types you then insert data into this table and finally you have a way to define how the tables are connected with each other I will now show you how SQL code is structured and give you the most important concept that you need to understand in order to succeed at SQL now this is a SQL statement it is like a complete sentence in the SQL language the statement defines where we want to get our data from and how we want to receive these data including any processing that we want to apply to it and once we have a statement we can select run and it will give us our data now the statement is made up of building block blocks which we call Clauses and in this statement we have a clause for every line so the Clauses that we see here are select from where Group by having order and limit and clauses are really the building blocks that we assemble in order to build statements what this course is about is understanding what each Clause is and how it works and then understanding how we can put together these Clauses in order to write effective statements now the first thing that you need to understand is that there is an order to write in these Clauses you have to write them in the correct order and there is no flexibility there if you write them in the wrong order you will simply get an error for example if I I were to take the work clause and put it below the group Clause you can see that I’m already getting an error here which is a syntax error but you don’t have to worry about memorizing this now because you will pick up this order as we learn each clause in turn now the essential thing that you need to understand and that slows down so many SQL Learners is that while we are forced by SQL to write Clauses in this specific order this is not actually the order in which the Clauses are executed if you’ve interacted with another programming language such as python or or JavaScript you’re used to the fact that each line of your program is executed in turn from top to bottom generally speaking and that is pretty transparent to understand but this is not what is happening here in SQL to give you a sense of the order in which these Clauses are run on a logical level what SQL does is that first it reads the from then it does the wear then the group by then the having then it does the select part after the select part is do it does the order by and finally the limit all of this just to show that the order in which operations are executed is not the same as the order in which they’re written in fact we can distinguish three orders that pertain to SQL Clauses and this distinction is so important to help you master SQL the first level is what we call the lexical order and this is simply what I’ve just shown you it’s the order in which you have to write these Clauses so that SQL can actually execute the statement and not throw you an error then there’s the logical order and this is the order in which the clause are actually run logically in the background and understanding this logical order is crucial for accelerating your learning of SQL and finally for the sake of completeness I had to include the effective order here because what happens in practice is that your statement is executed by a SQL engine and that engine will usually try to take shortcuts and optimize things and save on processing power and memory and so the actual order might be a bit different because the Clauses might be moved around um in the process of optimization but like I said I’ve only included it for the sake of completeness and we’re not going to worry about that level in this course with we are going to focus on mastering the lexical order and The Logical order of SQL Clauses and to help you master The Logical order of SQL Clauses or SQL operations I have created this schema and this is the fundamental tool that you will use in this course this schema as you learn it progressively will allow you to build a powerful mental model of SQL that will allow you to tackle even the most complex and tricky SQL problems now what this schema shows you is all of the Clauses that you will work with when writing SQL statements so these are the building blocks that you will use in order to assemble your logic and then the sequence in which they’re shown is corresponding to The Logical order in which they are actually executed and there are three simple rules for you to understand this schema the first rule is that operations are EX executed sequentially from left to right the second rule is that each operation can only use data that was produced by operations that came before it and the third rule is that each operation cannot know anything about data that is produced by operations that follow it what this means in practice is that if you take any of these components for example the having component you already know that having will have access to data that was produced by the operations that are to to its left so aggregations Group by where and from however having will have absolutely no idea of information that is produced by the operations that follow for example window or select or Union and so on of course you don’t have to worry about understanding this and memorizing it now because we will tackle this gradually throughout the course and throughout the course we will go back to the schema again and again in order to make sense of the work we’re doing and understand the typical errors and Pitfall that happen when working with SQL now you may be wondering why there are these two cases where you actually see two components stacked on top of each other that being from and join and then select an alas these are actually components that are tightly coupled together and they occur at the same place in The Logical ordering which is why I have stacked them like this in this section we tackle the basic components that you need to master in order to write simple but powerful SQL queries and we are back here with our schema of The Logical order of SQL operations which is also our map for everything that we learn in this course but as you can see there is now some empty space in the schema because to help us manage the complexity I have removed all of the components that we will not be tackling in this section let us now learn about from and select which are really the two essential components that you need in order to write the simplest SQL queries going back now to our data let’s say that we wanted to retrieve all of the data from the characters table in the fantasy data set now when you have to write a SQL query the first question you need to ask yourself is where is the data that I need because the first thing that you have to do is to retrieve the data which you can then process and display as needed so in this case it’s pretty simple we know that the data we want leaves in the characters table once you figured out where your data leaves you can write the from Clause so I always suggest starting queries with the from clause and to get the table that we need we can write the name of the data set followed by a DOT followed by the name of the table and you can see that bigquery has recognized the table here so I have written the from clause and I have specified the address of the table which is where the data leaves and now I can write the select clause and in the select Clause I can specify which Columns of the table I want to see so if I click on the characters table here it will open in a new tab in my panel and as you remember the it shows me here the schema of the table and the schema includes the list of all the columns now I can simply decide that I want to see the name and the guilt and so in the select statement here I will write name and guilt and when I run this I get the table with the two columns that I need and one neat thing about this I could write the columns in any order it doesn’t have to be the original order of the schema and the result will show that order and if I I wanted to get all of The Columns of the table I could write them here one by one or I could write star with which is a shorthand for saying please give me all of the columns so this is the corresponding data to our table in Google Sheets and if you want to visualize select in your mind you can imagine it as vertically selecting the parts of the table that you need for example if I were to write select Guild and level this would be equivalent to taking these two columns over here and selecting them let us now think of The Logical order of these operations so first comes the from and then comes the select and this makes logical sense right because the first thing you need to do is to Source the data and later you can select the parts of the data that you need in fact if we look at our schema over here from is the very first component in The Logical order of operations because the first thing that we need to do is to get our data we have seen that the select Clause allows us to get any columns from our table in any order but the select Clause has many other powers so let’s see what else we can do with it one useful thing to know about SQL is that you can add comments in the code and comments are parts of text which are not uh executed as code they’re just there for you to um keep track track of things or or explain what you are doing so I’m going to write a few comments now and the way we do comments is by doing Dash Dash and now I’m going to show you aliasing aliasing is simply renaming a column so I could take the level column and say as character level provided a new name and after I run this we can see that the name of the colum has changed now one thing that’s important to understand as we now start transforming the data with our queries is that any sort of change that we apply such as in this case we change the name of the column it only affects our results it does not affect the original table that we are querying so no matter what we do here moving forward Ward the actual table fantasy characters will not change all that will change are the results that we get after running our query and of course there are ways to go back to Fantasy characters and permanently change it but that is outside the scope for us and going back to our schema you will see that Alias has its own component and it happens happens at the same time as the select component and this is important because as we will see in a bit that it’s a common temptation to use these aliases these labels that we give to columns in the phases that precede this stage which typically fails because as our rules say um every component does not have access to data that is computed after it so something that we will come back to now another power of Select that we want to show is constants and constants is the ability of creating new columns which have a constant value for example let’s say that I wanted to implement a versioning system for my characters and I would say that right now all the characters I have are version one but then in the future every time I change a character I will increase that version and so that will allow me to keep track of changes I can do that by simply writing one over here in the column definition and when I run this you will see that SQL has created a new column and it has put one for every Row in that column this is why we call it a constant column so if I scroll down down all of it will be one and this column has a weird name because we haven’t provided a name for it yet but we already know how to do this we can use the alas sync command to say to call it version and here we go so in short when you write a column name in the select statement SQL looks for that column in the table and gives you that column but when instead you write a value SQL creates a new column and puts that value in every Row the next thing that SQL allows me to do is calculations so let me call the experience column here as well and get my data now one thing I could do is to take experience and divide it by 100 so what we see here is a new column which is the result of this calculation now 100 is a constant value right so you can imagine in the background SQL has created a new column and put 100 in every row and then it has done the calculation between experience and that new column and we get this result and and in short we can do any sort of calculation we want combining current columns and constants as well for example although this doesn’t make any sense I could take experience add 100 to it divided by character level and then multiply it by two and and we see that we got an error can you understand why we got this error pause the video and think for a second I am referring to my column as character level but what is character level really it is a label that I have assigned over here now if we go back to our schema we can see that select and Alias happen at the same time so so this is the phase in which we assign our label and this is also the phase in which we try to call our label now if you look at our rules this is not supposed to work because an operation can only use data produced by operations before it and Alias does not happen before select it happens at the same time in other words this part part over here when we say character level is attempting to use information that was produced right here when we assigned the label but because these parts happen at the same time it’s not aware of the label all this to say that the logical order of operations matters and that what we want here is to actually call it level because that is the name of the column in the table and now when I run this I get a resulting number and so going back to our original point we are able to combine columns and constants with any sort of arithmetic operations another very powerful thing that SQL can do is to apply functions and a function is a prepackaged piece of logic that you can apply to our data and it works like this there is a function called sqrt which stands for square root which takes a number and computes the square root so you call the function by name and then you open round brackets and in round brackets you provide the argument and the argument can be a constant such as 16 or it can be a column such as level and when I run this you can see that in this case the square root of 16 is calculated as four and this creates a constant column and then here for each value of level we have also computed the square root there are many functions in SQL and they vary according to the data type which you provide as you remember we said that knowing the data types of columns such as distinguishing between numbers and text is important because it it allows us to know which operations we can apply and so there are functions that work only on certain data types for example here we see square root which only works on numbers but we also have text functions or string functions which only work on text one of them is upper so if I take upper and provide Guild as an argument what do you expect will happen we have created a new column where the G is shown in all uppercase so how can I remember which functions there are and how to use them the short answer is I don’t uh there are many many functions in SQL and here in the documentation you can see a very long list of all the functions that you can use in big query and as we said the functions vary according to the data that they can work on so if you look look here on the left we have array functions um date functions mathematical functions numbering functions time functions and so on and so on it is impossible to remember all of these functions so all you need to know is how to look them up when you need them for example if I know I need to work with numbers I could scroll down here and go to mathematic iCal functions and here I have a long list of all the mathematical functions and I can see them all on the right and I should be able to find the square root function that I have showed you and here the description tells me what the function does and it also provides some examples to summarize these are some of the most powerful things you can do with a select statement not only you can retrieve every column you need in any order you can rename columns according to your needs you can Define constant columns with a value that you choose you can combine columns and constant columns in all sorts of calculations and you can apply functions to do more complex work I definitely invite you to go ahead and put your own data in big query as a I’ve shown you and then start playing around with select and see how you can transform your data with it one thing worth knowing is that I can also write queries that only include select without the front part that is queries that do not reference a table let’s see how that works now after I write select I clearly cannot reference any columns because there is no table but I can still reference constant for example I could say hello one and false and if I run this I get this result so remember in SQL we always query tables and we always get back tables in this case we didn’t reference any previous table we’ve just created constants so what we have here are three columns with constant values and there is only one row in the resulting table this is useful mainly to test stuff so let’s say I wanted to make sure that the square root function does what I expect it to do so I could call it right here and uh look at the result let’s use this capability to look into the order of arithmetic operations in SQL so if I write an expression like this would you be able to compute the final result in order to do that you should be able to figure out the order in which all these operations are done and you might remember this from arithmetic in school because SQL applies the same order that is taught in school and we could Define the order as follows first you would execute any specific functions that take a number as Target and uh then you have multi multiplication and division then you have addition and subtraction and finally brackets go first so you first execute things that are within brackets so pause the video and apply these rules and see if you can figure out what this result will give us now let’s do this operation and do it in stages like we were doing in school so first of all we want to worry about what’s in the brackets right so I will now consider this bracket over here and in this bracket we have the multiplication and addition multiplication goes first so first I will execute this which will give me four and then I will have 3 + 4 + 1 which should give me 8 next I will copy the rest of the operation and here here I reach another bracket to execute what is in these brackets I need to First execute the function so this is the power function so it takes two and exponentiate it to the power of two which gives four and then 4 minus 2 will give me two and this is what we get now we can solve this line and first of all we need to execute multiplication and division in the order in which they occur so the first operation that occurs here is 4 / 2 which is 2 and I will just copy this for clarity 8 – 2 * 2 / 2 the next operation that occurs now is 2 * 2 which is 4 so that would be 8 – 4 / 2 and the next operation is 4 / 2 which is two so I will have 8 – 2 and all of these will give me a six now all of these are comments and we only have one line of code here and to see whether I was right I just need to execute this code and indeed I get six so that’s how you can use the select Clause only to test your assumptions and uh your operations and a short refresher on the order of arithmetic operations which will be important for solving certain sequal problems let us now see how the where statement works now looking at the characters table I see that there is a field which is called is alive and this field is of type Boolean that means that the value can be either true or false so if I go to the preview here and scroll to the right I can see that for some characters this is true and for others it is false now let’s say I only wanted to get those characters which are actually alive and so to write my query I would first write the address of the table which is fantasy characters next I could use the where Clause to get those rows where is a five is true and finally I could do a simple select star to get all the columns and here I see that I only get the rows where is alive is equal to true so where is effectively a tool for filtering table rows it filters them because it only keeps rows where a certain condition is true and discards all of the other rows so if you want to visualize how the wear Filter Works you can see it as a horizontal selection of certain slices of the table like in this case where I have colored all of the rows in which is alive is true now the we statement is not limited to Boolean Fields it’s not limited to columns that can only be true or false we can run the we filter on any column by making a logical statement about it for example I could ask to keep all the rows where Health number is bigger than 50 this is a logical statement Health bigger than 50 because it is either true or fals for every row and of course the wh filter will only keep those rows where this statement evaluates to true and if I run this I can see that in all of my results health will be bigger than 50 and I can also combine smaller logical statements with each other to make more complex logical statements for example I could say that I want all the rows where health is bigger than 50 and is a live is equal to true now all of this becomes one big logical statement and again this will be true or false for every row and we will only keep the rows where it is true and if I run this you will see that in the resulting table the health value is always above 50 and is alive is always true in the next lecture we will see in detail how these logical statements work and how we can combine them effectively but now let us focus on the order of operations and how the wear statement fits in there when it comes to the lexical order the order in which we write things it is pretty clear from this example first you have select then from and after from you have the WHERE statement and you have to respect this order when it comes to The Logical order you can see that the where Clause follows right after the from Clause so it is second actually in The Logical order if you think about it this makes a lot of sense because the first thing that I need to do is to get the data from where it Lees and then the first thing I want to do after that is is that I’m going to drop all the rows that I don’t need so that my table becomes actually smaller and easier to deal with there is no reason why I should carry over rows that I don’t actually need data that I don’t actually want and waste memory and processing power on it so I want to drop those unneeded rows as soon as possible and now that you know that where happens at this stage in The Logical order you can avoid many of the pitfalls that happen when you’re just learning SQL let’s see an example now take a look at this query I’m going to the fantasy characters table and then I’m getting the name and the level and then I’m defining a new column this is simply level divided by 10 and I’m calling this level scaled now let’s say that I wanted to only keep the rows that have at at least three as level scaled so I would go here and write aware filter where level scaled bigger than three and if I run this I get an error unrecognized name can you figure out why we get this error level scaled is an alas that we assign in the select stage but the we Clause occurs before the select stage so the we Clause has no way to know about this alias in other words the we Clause is at this point and our rules say that an operation can only use data produced by operations before it so the we Clause has no way of knowing about the label which is a sign at this stage so how can we solve this problem right here the solution is to not use the Alias and to instead repeat the logic of the transformation and this actually works because it turns out that when you write logical statements in the we filter you can not only reference The Columns of the tables but you can also reference operations on columns and this way of writing operations of on columns and combinations between columns works just as what we have shown in the select part so that was all you need to know to get started with the wear clause which is a powerful Clause that allows us to filter out the row that we don’t need and keep the rows that we need based on logical conditions now let’s delve a bit deeper into how exactly these logical statements work in SQL and here is a motivating example for you this is a selection from the characters table and we have a wear filter and this we filter is needlessly complicated and I did this intentionally because by the end of this lecture you should have no trouble at all interpreting this statement and figuring out for which rows it will be true and likewise you will have no problem writing complex statements yourself or deciphering them when you encounter them in the wild the way that these logical statements work is through something called Boolean algebra which is an essential theory for working with SQL but also for working with any other programming language and is indeed fundamental to the way that computers work and though the name may sound a bit scary it is really easy to understand the fundamentals of Boolean algebra now let’s look back at so-called normal algebra meaning the common form that is taught in schools in this algebra you have a bunch of elements which in this case I’m only showing a few positive numbers such as 0 25 100 you also have operators that act on these elements for example the square root symbol the plus sign the minus sign the division sign or the multiplication sign and finally you have operations right so in operations you apply The Operators to your elements and then you get some new elements out of them so here are two different types of operation in one case we take this operator the square root and we apply it to a single element and out of this we get another element in the second kind of operation we use this operator the plus sign to actually combine two elements and again we get another element in return Boolean algebra is actually very similar except that it’s simpler in a way because you can only have two elements either true or false those are all the elements that you are working with and of course this is why when there’s a Boolean field in SQL it is a column that can only have these two values which are true and false now just like normal algebra Boolean algebra has several operators that we can use to transform the elements and for now we will only focus on the three most important ones which are not and and or and finally in Boolean algebra we also have operations and in operations we combine operators and elements and get back elements now we need to understand how these operators work so let us start with the not operator to figure out how a Boolean operator works we have to look at something that’s called a truth table so let me look up the truth table for the not operator and in this Wikipedia article this is available here at logical negation now first of all we see that logical negation is an operation on one logical value what does this mean it means that the not operator works on a single element such as not true or not false and this this is similar to the square root operator in algebra that works on a single element a single number next we can see how exactly this works so given an element that we call P and of course P can only be true or false the negation of p is simply the opposite value so not true is false and not false is true and we can easily test this in our SQL code so if I say select not true what do you expect to get we get false and if I do select not false I will of course get true next let’s see how the end operator works so we’ve seen that the not operator works on a single element on the other hand the end operator connects two elements such as writing true and false and in this sense the end operator is more similar to the plus sign here which is connecting two elements so what is the result of true and false to figure this out we have to go back to our truth tables and I can see here at The Logical conjunction function section which is another word for the end operator now the end operator combines two elements and each element can either be true or false so this creates four combinations that we see here in this table and what we see here is that only if both elements are true then the end operator will return true in any other case it will return false so going back here if I select true and false what do you expect to see I am going to get false and it’s only in the case when I do true and true that the result here will be true and finally we can look at the or operator which is also known as a logical disjunction it’s also combining two elements it also has four combinations but in this case if at least one of the two elements is true then you get true and only if both elements are false then you get false and so going back to our SQL true or true will of course be true but but even if one of them is false we will still get true and only if both are false we will get false so now you know how the three most important operators in Boolean algebra work now the next step is to be able to solve long and complex Expressions such as this one and you already know how operators work the only information you’re missing is the order of operations and just like in arithmetic we have an agreed upon order of operations that helps us solve complex expressions and the Order of Operations is written here first you solve for not then you solve for and and finally for or and as with arithmetic you first solve for the brackets so let’s see how that works in practice let us now simplify this expression so the first thing I want to do is to deal with the brackets so if I copy all of this part over here as a comment so it doesn’t run as code you will see that this is the most nested bracket the innermost bracket in our expression and we have to solve for this so what is true or true this is true right and now I can copy the rest of my EXP expression up to here and here I can solve the innermost bracket as well so I can say true and what I have here is false and true so this is false right because when you have end both of them need to be true for you to return true otherwise it’s false so I will write false moving on to the next line I need to solve what’s in the bracket so I can copy the knot and now I have to solve what’s in this bracket over here now there are several operators here but we have seen that not has the Precedence right so I will copy true and here I have not false which becomes true and then I can copy the last of the bracket I’m not going to do any more at this step to avoid confusion and then I have or and I can solve for this bracket over here and true and false is actually false moving on I can keep working on my bracket and so I have a lot of operations here but I need to give precedence to the ends so the first end that occurs is this one and that means I have to start with this expression over here true and and true results in true and then moving on I will copy the or over here and now I have another end which means that I have to isolate this expression false and true results in false and finally I can copy the final end because I’m not able to compute it yet because I needed to compute the left side and I can copy the remaining part as well moving on to the next line um I need to still do the end because the end takes precedence and so this is the expression that I have to compute so I will say true or and then this expression false and true computes to false and then copy the rest now let me make some rul over here and go to the next line and I can finally compute this bracket we have true or false which we know is true next I need to invert this value because I have not true which is false and then I have or false and finally this computes to false and now for the Moment of Truth F intended I can run my code and see if the result actually corresponds to what we got and the result is false so in short this is how you can solve complex expressions in Boolean algebra you just need to understand how these three operators work and you can use truth tables like like this one over here to help you with that and then you need to remember to respect the order of operations and then if you proceed step by step you will have no problem solving this but now let’s go back to the query with which we started because what we have here is a complex logical statement that is plugged into the wear filter and it isolates only certain rows and we want to understand exactly how this statement works so let us apply what we’ve just learned about Boolean algebra to decipher this statement now what I’ve done here is to take the first row of our results which you see here and just copi the values in a comment and then I’ve taken our logical statement and copied it here as well so let us see what SQL does when it checks for this Row the first thing that we need to do is to take all of these statements in our wear filter and convert them to true or false and to do that we have to look at our data let us start with the first component which is level bigger than 20 so for the row that we are considering level is 12 so this comes out as false next I will copy this end and here we have is alive equals true now for our row is alive equals false so this statement computes as false Mentor ID is not null with null representing absence of data in our case Mentor ID is one so it is indeed not null so here we have true and finally what we have in here is class in Mage Archer so we have not seen this before but it should be pretty intuitive this is a membership test this is looking at class which in this case is Hobbit and checking whether it can be found in this list and in our case this is now false so now that we’ve plugged in all the values for our row what we have here is a classic Boolean algebra expression and we are able to solve this based on what we’ve learned so let us go and solve this and first I need to deal with the brackets and what I have here I have an end and an or and the end TR takes precedence so false and false is false and I will copy the rest and here I have not false which is true next we have false or true which is true and true and in the end this computes to true now in this case we sort of knew that the result was meant to come out as true because we started from a row that survived the wear filter and that means that for this particular row this statement had to compute as true but it’s still good to know exactly how SQL has computed this and understand exactly what’s going on and this is how SQL deals with complex logical statements for each row it looks at the relevant values in the row so that it can convert the statement to a Boolean algebra expression and then it uses the Boolean algebra rules to compute a final result which is either true or false and then if this computes as true for the row then the row is kept and otherwise the row is discarded and this is great to know because this way of resoling solving logical statements applies not only to the word component but to all components in SQL which use logical statements and which we shall see in this course let us now look at the distinct clause which allows me to remove duplicate rows so let’s say that I wanted to examine the class column in my data so I could simply select it and check out the results so what if I simply wanted to see all the unique types of class that I have in my data this is where distinct comes in handy if I write distinct here I will see that there are only four unique classes in my data now what if I was interested in the combinations between class and guilt in my data so let me remove the distinct from now and add guilt here and for us to better understand the results I’m going to add an ordering and here are the combinations of class and Guild in my data there is a character who is an Archer and belongs to Gondor and there are actually two characters who are archers and belong belong to mirkwood and there are many Hobbits from sholk and so on but again what if I was interested in the unique combinations of class and Guild in my data I could add the distinct keyword here and as you can see there are no more repetitions here Archer and merkwood occurs only once Hobbit and Shar f occurs only once because I’m only looking at unique combinations and of course I could go on and on and add more columns and expand the results to show the unique combinations between these columns so here Hobbit and sherol has expanded again because some Hobbits are alive and others unfortunately are not at the limit I could have a star here and what I would get back is actually my whole data all the 15 rows because what we’re doing here is looking at rows that have the same value on all columns rows that are complete duplicates and there are no such rows in the data so when I do select star in this case distinct has no effect so in short how distinct works it looks at the columns that you’ve selected only those which you have selected and then it looks at all the rows and two rows are duplicate if they have the exact same values on every column that you have selected and then duplicate rows are removed and only unique values are preserved so just like the wear filter the distinct is a clause that removes certain rows but it is more strict and less flexible in a sense it only want does one job and that job is to remove duplicate rows based on your selection and if we look at our map of SQL operations we can place distinct it occurs right after select right and and this makes sense because we have seen that distinct Works only on the columns that you have selected and so it has to wait for select to choose the columns that we’re interested in and then we can D duplicate based on those for the following lecture on unions I wanted to have a very clear example so I decided to go to the characters table and split it in two and create two new tables out of it and then I thought that I should show you how I’m doing this because it’s a pretty neat thing to know and it will help you when you are working with SQL in bigquery so here’s a short primer on yet another way to create a table in bigquery you can use your newly acquired power of writing cql queries to turn those queries into permanent tables so here’s how you can do it first I’ve written a simple query here and you should have no trouble understanding it by now go to the fantasy characters table keep only rows where is alive is true and then get all the columns next we need to choose where the new table will live and how it will be called so I’m placing it also in the fantasy data set and I’m calling it characters alive and finally I have a simple command which is create table now what you see here is a single statement in SQL it’s a single command that will create the table and you can have in fact multiple statements within the same code and you can run all the statements together when you hit run the trick is to separate all of them with this semicolon over here the semicolon tell SQL hey this command over here is over and and uh next I might add another one so here we have the second statement that we’re going to run and this looks just like the one above except that our query has changed because we’re getting rows where is alive is false and then we are calling these table characters dead so I have my two statements they’re separated by semicolons and I can just hit run and I will see over here that bigquery is showing me the two statements on two different rows and you can see that they are both done now so if I open my Explorer over here I will see that I have two new tables characters alive and characters dead and if I go here for characters alive is alive will of course be true on every row now what do you think would happen if I ran this script again let’s try it so I get an error the error says that the table already exists and this makes sense because I’ve told SQL to create a table but SQL says that table already exists I cannot create it again so there are ways that we can tell SQL what to do if the table already exists again so that we specify the behavior we want and we are not going to just get an error one way is to say create or replace table fantasy characters alive and what this will do is that if the table already exists uh big query will delete it and then create it again or in other words it will overwrite the data so let’s write it down to and let’s make sure that this query actually works so when I run this I will get no errors even if the table already existed because bigquery was able to remove the previous table and create a new one alternatively we may want to create the table only if it doesn’t exist yet and leave it untouched otherwise so in that case we could say create table if not exists so what this will do is that if this table is already existing big query won’t touch it and it won’t throw an error but if it doesn’t exist it will create it so let us write it down two and make sure that this query runs without errors and we see that also here we get no errors and that in short is how you can save the results of your queries in big query and make them into full-fledged tables that you can save and and create query at will and I think this is a really useful feature if you’re analyzing data in big query because any results of your queries that you would like to keep you can just save them and then come back and find them later let’s learn about unions now to show you how this works I have taken our characters table and I have split it into two parts and I believe the name is quite self descriptive there is a separate table now for characters who are alive and a separate table for characters who are dead and you can look at the previous lecture to see how I’ve done this how I’ve used a query to create two new tables but this is exactly the characters table with you know the same schema the same columns the same times is just split in two based on the E alive column now now let us imagine that we do not have the fantasy. characters table anymore we do not have the table with all the characters because it was deleted or we never had it in the first place and let’s pretend that we only have these two tables now characters alive and characters dead and we want to reconstruct the characters table out of it we want to create a table with all the characters how can we do that now what I have here are two simple queries select star from fantasy characters alive and select star from fantasy characters dead so these are two separate queries but actually in big query there are ways to run multiple queries at the same time so I’m going to show you first how to do that now an easy way to do that is to write your queries and then add a semicolon at the end and so what you have here is basically a SQL script which contains multiple SQL statements in this case two and if you hit run all of these will be executed sequentially and when you look at the results so you’re not just getting a table anymore because it’s not just a single query that has been executed but you can see that there have been two commands uh that have been executed which are here and then for each of those two you can simply click View results and you will get to the familiar results tab for that and if I want to see the other one I will click on the back arrow here and click on the other view results and then I can see the other one another way to handle this is that I can select the query that I’m interested in and then click run and here I see the results so big query has only executed the part that I have selected or I can decide to run the other query in my script select it click run and then I will see the results for that query and this is a pretty handy functionality in big query it’s also functionality that might give you some headaches if you don’t know about it because if for some reason you selected a part of the code uh during your work and then you just want to run everything you might hit run and get an error here because B queer is only seeing the part that you selected and cannot make sense of it so it’s good to know about this but our problem has not been solved yet because remember we want to reconstruct the characters table and what we have here are two queries and we can run them separately and we can look at the results separately but we still don’t have a single table with all the results and this is where Union comes into play Union allows me to stack the results from these two tables so so if I take first I will take off the semic columns because this will become a single statement and then in between these two queries I will write Union distinct and when I run this you can verify for yourself we have 15 rows and we have indeed reconstructed the characters table so what exactly is going on here well it’s actually pretty simple SQL is taking all of the rows from this first query and then all of the rows for the second query and then it’s stacking them on top of each other so you can really imagine the act of horizontally stacking a table on top of the other to create a new table which contains all of the rows of these two queries combined and that in short is what union does now there are a few details that we need to know when working with Union and to figure them out let us look at a toy example so I’ve created two very simple tables toy one and toy two and you can see how they look in these comments over here they all have three columns which are called imaginatively call One Call two call three and then this is the uh Toy one table and then this is the toy 2 table now just like before we can combine this table tabls by selecting all of them and then writing a union in between them now in B query you’re not allow to write Union without the further qualifier a keyword and it has to be either all or distinct so you have to choose one of these two and what is the choice about well if you do Union all you will get all of the rows that are in the first table and those that are in the second table regardless of whether they are duplicate okay but with Union distinct you will get again all of the rows from the two tables but you will only consider unique rows you will not get any duplicates now we can see that these two table share a column which is actually identical one through yes over here and the same row over here now if I write Union all I expect the result to include this row twice so let us verify that and you can see that here you have one true yes and at the end you also have one true yes and in total you get four rows which are all the rows in the two tables however if I do Union distinct I expect to get three rows and I expect this row to appear only once and not to be duplicated again you need to make sure you’re not selecting any little part of your script before you run it so the whole script will be run and as you can see we have three rows and there are no duplicates now it’s interesting that big query actually forces you to choose between all or distinct because in many SQL systems for your information you are able to write Union without any qualifier and in that case it means Union distinct so in other SQL systems when you write Union it is understood that you want Union distinct and if you actually want to keep the duplicate rows you will explicitly write Union all but in big query you always have to explicitly say whether you want Union all or Union distinct now the reason this command is called Union and not like stack or or something else is is that this is a set terminology right this comes from the mathematical theory of sets which you might remember from school and the idea is that a table is simply a set of rows so this table over here is a set of two rows and this table over here is a set of two rows and once you have two sets you can do various set operations between them and the most common operation that we do in SQL is unioning and unioning means combining the values of two sets so you might remember from school the V diagram which is a typical way to visualize the relations between sets so in this simple vent diagram we have two circles A and B which represent two sets and in our case a represents the collection of rows in the first table and B represents the all the rows that are in the second table so what does it mean to Union these sets it means taking all of the elements that are in both sets so taking all of the rows that are in both tables and what is the difference here between union distinct and Union all where you can see that the rows of a are this part over here plus this part over here and the rows of B are this part over here plus this part over here and so when we combine them we’re actually counting the intersection twice we are counting this part twice and so what do you do with this double counting do you keep it or do you discard it if you do Union all you will keep it so rows that are in common between A and B will duplicate you will see them twice twice but if you do Union distinct you will discard it and so um you won’t have any duplicates in the results so that’s one way to think about it in terms of sets but we also know that Union is not the only set operation right there are other set operations a very popular one is the intersect operation now the intersect looks like this right it it says take only the El elements that are in common between these two sets so can we do that in SQL can we say give me only the rows that are in common between the two tables and the answer is yes we can do this and if we go back here we can instead of Union write intersect and then distinct and what do you expect to see after I run this command take a minute to think about it so what I expect to see is to get only the rows that are shared between the two tables now there is one row which is shared between these two tables which is uh the one true yes row which we have seen and if I run this I will get exactly this row so intersect distinct gives me the rows that are shared between the two tables and I have to write intersect distinct I cannot write intersect all because actually doesn’t mean anything so it’s not going to work and here’s another set operation which you might consider which is subtraction so what if I told you give me all of the elements in a except the elements that a shares with B so what would that look on the drawing it would look like this right so this is taking all of the elements that are in a except these ones over here because they are in a but they’re also in B and I don’t want the elements shared with b and yes I can also do that in squl I can come here and I could say give me everything from Toy one except distinct everything from Toy two and what this means is that I want to get all of my rows from Toy one except the rows that are shared with toy two so what do you expect to see when I run this let’s hit run and I expect to see only this row over here because this other row is actually shared with b and this is what I get again you have to write accept distinct you cannot write accept all because it’s actually actually doesn’t mean anything and keep in mind that unlike the previous two operations which are union and distinct the accept operation is not symmetric right so if I swap the tables over here I actually expect to get a different result right I expect to see this row over here selected because I’m saying give me everything from this table uh Toy 2 except the rows that are shared with toy one so so let us run this and make sure and in fact I get the three through uh maybe row so careful that the accept operation is not symmetric the order in which you put the two tables matters so that was a short overview of Union intersect except and I will link this here which is the bigquery documentation on this and you can see that they’re actually called set operators in fact in real life you almost always see Union very rarely you will see somebody using intersect or accept a lot of people also don’t know about them but I think it’s worth it that we briefly looked at all three and it’s especially good for you to get used to thinking about tables as sets of rows and thinking about SQL operations in terms of set set operations and that will also come in handy when we study joints but let us quickly go back to our toy example and there are two essential prerequisites for you to be able to do a union or any type of sort operations number one the tables must have the same number of columns and number two the columns must have the same same data type so as you can see here we are combining toy 2 and toy 1 and both of them have three columns and the First Column is an integer the second is a Boolean and the third is a string in both tables and this is how we are able to combine them so what would happen if I went to the first table and I got only the first two columns and then I tried to combine it you guessed it I would get an error because I have a mismatched column count so if I want to select only the first two columns in a table I need to select only the first two columns in another table and then the union will work now what would happen if I messed up the order of the columns so let’s say that here I will select uh column one and column 3 and here I will select column one and column two let me run this and I will get an error because of incompatible types string and bull so what’s happening here is that SQL is trying to get the values of call three over here and put it into call two over here and it’s trying to get a string and put it into a Boolean column and that simply doesn’t work because as you know SQL enforces streak Types on columns and so this will not work but of course I could select call three in here as well and now again we will have a string column going into a string column and of course this will work so so to summarize you can Union or intersect or accept any two tables as long as they have the same number of columns and the columns have the same data types let us now illustrate a union with a more concrete example so we have our items table here and our characters table here so the items table repres represents like magical items right while the characters table we’re familiar with it represents actual characters so let’s say that you are managing a video game and someone asks you for a single table that contains all of the entities in that video game right and the entities include both characters and items so you want to create a table which combines these two tables into one we know we can use Union to do that we know we can use Union to stack all the rows but we cannot directly Union these two tables be because they have a different schema right they have a different number of columns and then those columns have different data types but let’s analyze what these two tables have in common and how we could maybe combine that so first of all they both have an ID and in both cases it’s an integer so that’s already pretty good they both have a name and in both cases the name is a string so we can combine that as as well the item type can be thought of being similar to the class and then each item has a level of power which is expressed as an integer and each character has a level of experience which is expressed as an integer and you can think that they are kind of similar and then finally we have a timestamp field representing a moment in time for both columns which are date added and last active so looking at this columns that the two have sort of in common we can find a way to combine them and here’s how we can translate this into SQL right so I’m went to the fantasy items table and I selected The Columns that I wanted and then I went to the characters table and I selected the columns that I wanted to combine with those um in in the right order so we have ID with ID name with name class with item type level with power and last active with date added so I have my columns they’re in the right order I wrote Union distinct and if I run this you will see that I have successfully combined the rows from these two tables by finding out which columns they have in common and then writing them in the right order and then adding Union distinct now all the columns that we’ve chosen for the combination have the same type but what would happen if I wanted to combine two columns that are not actually the same type so let’s say what if we wanted to combine Rarity which is a string with experience which is an integer as you know I cannot do this directly but I can go around it by either taking Rarity and turning it into an integer or taking um experience and turning it into a string I just have to make sure that they both have the same data type now the easiest way is usually to take um any other data type and turn it into a string because we you just turn it into text so let’s say that for the sake of this demonstration we will take integer experience which is an integer and turn it into a string which is text and then combine that with Rarity so I will go back to my code and I will make some room over here and here in items I will add Rarity and here in characters I will add experience and you can see that I already get an error here saying that the union distinct has incompatible types just like expected so what I want to do here is to take experience and turn it into string and I can do that with the cast function so I can do cast experience as string and what this will do is basically take these values and convert them to string and if I run this you can see that this has worked so we combined two tables into one and now the result is a single table it has a column called Rarity the reason it’s called Rarity is that um it’s it’s taking the name from the first table in the in the operation but we could of course rename it to whatever we need and this is now a text column because we have combined a text column with also a text column thanks to the casting function so what we see here are a bunch of numbers which came originally from The Experience uh column from the character table but they’re now converted to text and if I scroll down then I will also see the original values of Rarity from the items table finally let us examine Union in the context of The Logical order of SQL operations so you can see here that we have our logical map but it looks a bit different than usual and the reason it’s different is that we are considering what happens when you un two tables and here the blue represents one table and the red represents the other table so I wanted to show you that all of the ordering that we have seen until now so first get the table then use the filter with where then select the columns you want and if you want use this thing to remove duplicates all of these happens in the same order separately for the two tables that you are unioning and this applies to all of the other operations like joining and grouping which we will see um later in the course so at first the two tables are working on two separate tracks and SQL is doing all this operations on them in this specific order and only at the end of all this only after all of these operations have run then we have the union and in the Union these two tables are combined into one and only after that only after the tables have been combined into one you apply the last two operations which are order by and limit and actually nothing forces you to combine only two tables you could actually have any number of tables that you are combining in Union but then the logic doesn’t change at all all of these operations will happen separately for each of the tables and then only when all of these operations are done only when all of the tables are ready then they will be combined into one and if you think about it it makes a lot of sense because first of all you need the select to have run in order to know what is the schema of the tables that you are combining and then you also also need to know if distinct has run on each uh table because you need to know which rows you need to combine in the union and that is all you need to know to get started with Union this very powerful statement that allows us to combine rows from different tables let us now look at order by so I’m looking at the characters table here and as you can see we have an ID column that goes from one to 15 which assigns an ID to every character but you will see that the IDS don’t appear in any particular order and in fact this is a general rule for SQL there is absolutely no order guarantee for your data your data is not stored in any specific order and your data is not going to be returned in any specific order and the reason for this is fun fundamentally one of efficiency because if we had to always make sure that our data was perfectly ordered that would add a lot of work it would add a lot of overhead to the engine that makes the queries work and uh there’s really no reason to do this however we do often want to order our data when we are querying it we want to order the way that it is displayed and this is why the order by clause is here so let us see how it works I am selecting everything from fantasy characters and again I’m going to get the results in no particular order but let’s say I wanted to see them in uh ordered by name so then I would do order by name and as you can see the rows are now ordered alphabetically according to the name I could also invert the order by writing desk which stands for descending and that means U descending alphabetical order which means from the last letter in the alphabet to the first I can of course also order by number columns such as level and we would see that the level is increasing here and of course that could also be descending to to go in the opposite direction and the corresponding keyword here is ask which stands for ascending and this is actually the default Behavior so even if you omit this you will get the same going from the smallest to the largest I can also order by multiple columns so I could say order by class and then level and what that looks like is that first of all the rows are ordered by class so as you can see this is done alphabetically so first Archer and then the last is Warrior and then within each class the values within the class are ordered according to the level going from the smallest level to the biggest level and I can invert the order of one of them for example class and in this case we will start with Warriors and then within the warrior class we will still will order the level in ascending order so I can for every column uh that’s in the ordering I can decide whether that ordering is in ascending order or descending order now let us remove this and select the name and the class and once again I get my rows in no particular order and I’m seeing the name and the class so I wanted to show you that you can also order by columns which you have not selected Ed so I could order these elements by level even though I’m not looking at at level and it will work all the same and finally I can also order by operations so I could say take level divide it by experience and then multiply that by two for some reason and it would also work in the order ordering even though I am not seeing that calculation that calculation is being done in the background and used for the ordering so I could actually take this here and copy it create a new column call it calc for calculation and if I show you this you will see the results are not uh very meaningful but you will see that they are in ascending order so we have ordered by that and sometimes you will see this notation over here order by 21 for example and as you can see what we’ve done here is that we’ve ordered by class first of all because we starting with archers and going to Warriors and then within each class we are ordering by name uh also in ascending order so this is basically referring to the columns that are referenced in the select two means order by the second column which you have referenced which in this case is class and one means order by the First Column that you referenced so it’s basically a shortcut that people sometimes use to avoid rewriting the names of columns that they have selected and finally when we go back to the order of operations we can see that order bu is happening really at the end of all of this process so as you will recall I have created this diagram that’s a bit more complex to show show what happens when we Union different tables together what happens is that basically all these operations they run independently on each table and then finally the tables get uh unioned together and after all of this is done SQL knows the final list of rows that we will include in our results and that’s the right moment to order those rows it would not be possible to do that before so it makes sense that order is located here let us now look at the limit Clause so what I have here is a simple query it goes to the characters table it filters for the rows where the character is alive and then it gets three columns out of this so let’s run this query and you can see that this query returns 11 rows now let us say that I only wanted to see five of those rows and this is where limit comes into place limit will look at the final results and then pick five rows out of those results reducing the size of my output and here you can see that we get five rows now as we said in the lecture of ordering by default there is no guarantee of order in a SQL system so when you are getting all your data with a query and then you run limit five on top of it you have no way of kn knowing which of the rows will be selected to fit amongst those five you’re basically saying that you’re okay with getting any five of all of the rows from your result because of this people often will use limit in combination with order by for example I could say order by level and then limit five and what I would get here is essenti the first five most inexperienced characters in my data set and let us say that you have a problem of finding the least experienced character in your data the character with the lowest level so of course you could say order by level and then limit one and you would get the character with the lowest level right and this works however it is not ideal there is a problem with this solution so can you figure out what the problem with this solution is the problem will be obvious once I go back to limit 5 and I look here and I see that I actually have two characters which have the lowest level in my data set so in theory I should be able to return both of them because they both have the lowest level however when I write limit one it simply cuts the rows in my output and it is unaware of that uh further information that is here in this second row and in the further lectures we will see how we can solve this better and get results which are more precise and if we look at The Logical order of operations we can see that limit is the very last operation and so all of the logic of our query is executed all our data is computed and then based on that final result we sometimes decide to not output all of it but to Output a limited number of rows so a common mistake for someone who is starting with SQL is thinking that they can use limit in order to have a cheaper query for example you could say oh this is a really large table this table has two terabytes of data it would cost a lot to scan the whole table so I will say select star but then I will put limit 20 because I only want to see the first 20 rows and that will means that I will only scan 20 rows and my query will be very cheap right no that is actually wrong that doesn’t save you anything and you can understand this if you look at the map because all of the logic is going to execute before you get to limit so you’re going to scan the whole table when you say select star and you’re going to apply all of the logic and the limit is actually just changing the way your result is displayed it’s not actually changing the way the your result is computed if you did want to write your query so that it scans less rows one thing you should do is focus on the where statement actually because the where statement is the one that runs in the beginning right after getting the table and it is able to actually eliminate rows which usually saves you on computation and money and so on however I do need to say that there are systems where writing limit may actually turn into savings because different systems are optimized in different ways and um allow you to do different things with the commands but as a rule usually with SQL limit is just changing the way your result is displayed and doesn’t actually change anything in the logic of execution let us now look at the case clause which allows us to apply conditional logic in SQL so you can see here a simple query I am getting the data from the characters table I am filtering it so that we only look at characters who are alive and then for each character we’re getting the name and the level now when you have a column that contains numbers such as level one typical thing that you do in data analysis is bucketing and bucketing basically means that I look at all these multiple values that level can have and and I reduce them to a smaller number of values so that whoever looks at the data can make sense of it uh more easily now the simplest form of bucketing that you can have is the one that has only two buckets right so looking at level our two buckets for example could be uh in one bucket we put values that are equal or bigger than 20 so characters who have a level that’s at least 20 and in the other bucket we put all the characters that have a level that is less than 20 for example now how could I Define those two buckets so we know that we can Define new columns in the select statement and that we can use calculations and logical statements to define those columns so one thing that I could do would be to go here and then write level bigger than bigger or equal than 20 and then call this new column level at least 20 for example and when I run this I get my column now of course this is a logical statement and for each row this will be true or false and then you can see that our new column here gives us true or false on every column and this is a really basic form of bucketing because it allows us to take you know level has basically 11 different values in our data and it can be complicated to look at this many values at once and now we’ve taken these 11 values and reduced them to two uh to two buckets so that we have um organized our data better and it’s easier to read but there are two limitations with this approach one I might not want to call my buckets true or false I might want to give more informative names to my buckets such as experienced or inexperienced for example the other limitation is that with this approach I can effectively only divide my data in two buckets because once I write a logical statement it’s either either true or false so my data gets divided in two but often it’s the case that I want to use multiple buckets for my use case now bucketing is a typical use case for the case when statement so let’s see it in action now so let me first write a comment not any actual code where I Define what I want to do and then I will do it with the code so I have written here the buckets that I want to use to classify the characters level so up to 15 they are considered low experience between 15 and 25 they are considered mid and anything above 25 we will classify as super now let us apply the case Clause to make this work so the case Clause Is Always bookended by these two parts case and end so it starts with case it ends with end and a typical error when you’re getting started is to forget about the end part so my recommendation is to always start by writing both of these and then going in the middle to write the rest now in the middle we’re going to Define all the conditions that we’re interested in and each condition starts with the keyword when and is Then followed by a logical condition so our logical condition here is level smaller than 15 now we have to Define what to do when this condition is true and it follows with the keyword then and when this condition is true we want to return the value low which is a string a piece of text that says low next we proceed with the following condition so when level is bigger and equal to 15 and level is lower than 25 so if you have trouble understanding this logical statement I suggest you go back to the lecture about Boolean algebra but what we have here there are two micro statements right Level under 25 and level equal or bigger than 15 they are conect connected by end which means that both of these statements have to be true in order for the whole statement to be true which is what we want in this case right and what do we want to return in this case we will return the value mid and the last condition that we want to apply when level is bigger or equal than 25 then we will return super now all of this that you see here this is the case Clause right or the case statement and all of this is basically defining a new column in my table and given that it’s a new column I can use the alas sync to also give it a name and I can call this level bucket now let’s run this and see what we get and as you can see we have our level bucket and the characters that are above 25 are super and then we have a few Ms and then everyone who’s under 15 is low so we got the results we wanted and now let us see exactly how the case statement works so I’m going to take Gandalf over here and he has level 30 so I’m going to write over here level equals 30 because we’re looking at the first low row and that is the value of level and then I’m going to take the conditions for the case statement that we are examining and add them here as a comment now because in our first row level equals 30 I’m going to take the value and substitute it here for level now what we have here is a sequence of logical statements and we have seen how to work with these logical statements in the lecture on Boolean algebra now our job is to go through each of these logical statements in turn and evaluate them and then as soon as we find one that’s true we will stop so the first one is 30 smaller than 50 now this is false so we continue the second one is a more complex statement we have 30 greater or equal to 15 which is actually true and 30 Oops I did not substitute it there but I will do it now and 30 smaller than 25 which is false and we know from our Boolean algebra that true and false evaluates to false therefore the second statement is also false so we continue and now we have 30 greater or equal than 25 which is true so we finally found a line which evaluates as true and that means that we return the value super and as you can see for Gandalf we have indeed gotten the value super let us look very quickly at one more example we get Legolas which is level 22 and so I will once again copy this whole thing and comment it and I will substitute 22 for every value of level cuz that’s the row we’re looking at and then looking at the first row 22 small than 15 is false so we proceed and then looking at the second row 22 bigger than 15 is true and 22 smaller than 25 is also true so we get true and true which evaluates to true and so we return mid and then looking at Legolas we get mid so this is how the case when statement Works in short for each row you insert the values that correspond to your Row in this case the value of level and then you evaluate each of these logical conditions in turn and as soon as one of them returns true then you return the value that corresponds to that condition and then you move on to the next row now I will clean this up a bit and now looking at this statement now and knowing what we know about the way way it works can we think of a way to optimize it to make it nicer to remove redundancies think about it for a minute now one thing we could do to improve it is to remove this little bit over here because if you think about it this part that I have highlighted is making sure that the character is not under 15 so that it can be classified as meat but actually we already have the first condition that makes sure that if the character is under 15 then the statement will output low and then move on so if the character is under 15 we will never end up in the second statement but if we do end up in the second statement we already know that the character is not under 15 this is due to the fact that case when proceeds condition by condition and exits as soon as the condition is true so effectively I can remove this part over here and then at the second condition only make sure that the level is below 25 and you will see if you run this that our bucketing system works just the same and the other Improvement that I can add is to replace this last line with an else CL Clause so the else Clause takes care of all the cases that did not meet any of the conditions that we specified so the case statement will go condition by condition and look for a condition that’s true but in the end if none of the conditions were true it will return what the else Clause says so it’s like a fallback for the cases when none of our conditions turned out to be true and if you look at our logic you will see that if this has returned false and this has returned false all that’s left is characters that have a level which is either 25 or bigger than 25 so it is sufficient to use an else and to call those super and if I run this you will see that our bucketing works just the same for example Gandalf is still marked as super because in the case of Gandalf this condition has returned false and this condition has returned false and so the else output has been written there now what do you think would happen if I completely removed the else what do you think would happen if I only had two conditions but it can be the the case that none of them is true what will SQL do in that case let us try it and see what happens so the typical response in SQL when it doesn’t know what to do is to select the null value right and if you think about it it makes sense because we have specified what happens when level is below 15 and when level is is below 25 but none of these are true and we haven’t specified what we want to do when none of these are true and because we have been silent on this issue SQL has no choice but to put a null value in there so this is practically equivalent to saying else null this is the default behavior for SQL when you don’t specify an else Clause now like every other piece of SQL the case statement is quite flexible for instance you are not forced to always create a text column out of it you can also create an integer column so you could Define a simpler leveling system for your characters by using one and two else three for the higher level characters and uh this of course will also work as you can see here however one thing that you cannot do is to mix types right because what this does is that it results in one column in a new column and as you know in SQL you’re not allowed to mix types between columns so always keep it consistent when it comes to typing and then when it comes to writing the when condition all the computational power of SQL is available so you can reference columns that you are not selecting you can run calculations as I am doing here and you can change logical statements right Boolean statements in complex ways you can really do anything you want although I generally suggest to keep it as simple as possible for your sake and the sake of the people who use your code and that is really all you need to know to get started with the case statement to summarize the case statement allows us to define a new columns whose values are changing conditional on the other values of my row this is also called conditional logic which means that we consider several conditions and then we do have different behaviors based on which condition is true and the way it works is that in the select statement when you are mentioning all your columns you create a new column which in our case is this one and you bookend it with a case and end and then between those you write your actual conditions so every condition starts with a when is followed by a logical statement which needs to evaluate to true or false and then has the keyword then and then a value and then the case when statement will go through each of these conditions in turn and as soon as one of them evaluates to true you will output the value that you have specified if none of the conditions evaluate to true then it will output the value that you specify in the else keyword and if the lse keyword is missing it will output null and so this is what you need to use the case statement and then experience and exercise and coding challenges will teach you when it’s the case to use it pun intended now where does the case statement fit in our logical order of SQL operations and the short answer is that it is defined here at the step when you are selecting your columns that’s when you can use the case when statement to create a new column that applies your conditional logic and this is the same as what we’ve shown in the lecture on SQL calculations you you can use select statement not only to get columns which already exist but to Define new columns based on calculations and logic now let us talk about aggregations which are really a staple of any sort of data analysis and an aggregation is a function that takes any number of values and compresses them down to a single informative value so I’m looking here at at my usual characters table but this is the version that I have in Google Sheets and as you know we have this level column which contains the level of each character and if I select this column in Google Sheets you will see that in the bottom right corner I can see here a number of aggregations on this column and like I said no matter how many values there are in the level columns I can use aggregations to compress them to one value and here you see some of the most important aggregations that you will work with some simply adding up all values together the average which is doing the sum and then dividing by the number of values the minimum value the maximum the count and the count numbers which is the same here so these are basically summaries of my column and you can imagine in cases where where you have thousands or millions of values how useful these aggregations can be for you to understand your data now here’s how I can get the exact same result in SQL I simply need to use the functions that SQL provides for this purpose so as you can see here I’m asking for the sum average minimum maximum and count of the column level and you can see the same results down here now now of course I could also give names to this column for example I could take this one and call it max level and in the result I will get a more informative column name and I can do the same for all columns now of course I can run aggregations on any columns that I want for example I could also get the maximum of experience and call this Max experience and I can also run aggregations on calculations that involve multiple columns as well as constants so everything we’ve seen about applying arithmetic and logic in SQL also applies now of course looking at the characters table we know that our columns have different data types and the behavior of the aggregate functions also is sensitive to the data types of the columns for example let us look at the many text columns that we have such as class now clearly not all of the aggregate functions that we’ve seen will work on class because how would you take the average of these values it’s not possible right however there are some aggregate functions that also work on strings so here’s an example of aggregate functions that we can run on a string column such as class first we have count which simply counts the total number of non null values and I will give you a bit more detail about the count functions soon then we have minimum and maximum now the way that strings are ordered in SQL is something called lexicographic order which is basically a fancy word for alphabetical order and basically you can see here that for minimum we get the text value that occurs earlier in uh alphabetical order whereas Warrior occurs last and finally here’s an interesting one called string EG and what this does is that this is a function that actually takes two arguments the first argument as usual is the name of the column and the second argument is a separator and what this outputs is now a single string a single piece of text where all of the other pieces of text have been glued together and then separated by this character that we specified over here which in our case is a comma Now if you go to the Google documentation you will find an extensive list of all the aggregate functions that you can use in Google SQL and this includes the ones that we’ve just seen such as average or Max as well as a few others that we will not explore in detail here so let us select one of them such as average and see what the description looks like now you can see that this function Returns the average of all values that are not null and don’t worry about this expression in an aggregated group for now just think about this as meaning all the values that you provide to the function all the values in the column now there is a bit about window functions which we will see later and here there are in the caveat section there are some interesting edge cases for example what happens if you use average on an empty group or if all values are null in that case it returns null and so on you could see what the function does when it finds these edge cases and here is perhaps the most important section which is supported argument types and this tells you what type of columns you can use this aggregation function on so you can see that you can use average on any numeric input type right any column that contains some kind of number and also on interval and interval we haven’t examined it in detail but this is actually a data type that specifies a certain span of time so interval could express something like 2 hours or 4 days or 3 months it is a quantity of time and finally in this table returned data types you can see what the average function will give you based on the data type that you insert so if you insert uh integer column it will return to you a float column and that makes sense because the average function involves a division and that division will usually give you floating Point values but for any other of the allowed input types such as numeric bit numeric and so on and these are all data types which represent numbers in B query the average function as you can see here will present Reserve that data type and finally we have some examples so whenever you need to use an aggregate function that is whenever you need to take many values a sequence of multiple values and compress them all down to one value but you’re not sure about which function to use or what the behavior of the function is you can come to this page and look up the functions that interest you and then read the documentation to see how they work now here’s an error that typically occurs when starting out with aggregations so you might say well I want to get the name of each character and their level but I also want to see the average of all levels and because I want to compare those two values I want to compare the level of my character with the average on all levels so I can write a query that looks like this right go to the Fant as a characters table and then select name level and then average level but as you can already see this query is not functioning it’s giving me an error and the error says that the select list expression references column name which is neither grouped nor aggregated so what does this actually mean to show you what this means I’ve gone back to my Google Sheets where I have the same data for my characters table and I have copy pasted our query over here now what this query does it takes the name column so I will copy paste it over here and then it takes the level column copy paste this here as well and then it computes the average over level now I can easily compute this with sheet formula by writing equal and then calling the function which is actually called average and then within the function I can select all these values over here and I get the average now this is the result that SQL computes but SQL is actually not able to return this result and the reason is that there are three columns but they have mismatch number of values specifically these two columns have 15 values each whereas this column has a single value and SQL is not able to handle this mismatch because as a rule every SQL query needs to return a table and a table is a series of columns where each column has the same number of values if that constraint is not respected you will get an error in SQL and we will come back to this limitation when we examine Advanced aggregation techniques but for now just remember that you can mix non-aggregated columns with other non-aggregated columns such as name and level and you can mix aggregated columns with aggregated columns such as average level with some level for example so I could simply do this and I would be able to return this as a table because as you can see there are two columns both columns have a single Row the number of rows matches and this is actually valid but you might ask can’t I simply take this value over here and just copy it in every row and until I make sure that average level has the same number of values as name and level and so return a table and respect that constraint indeed this is possible you can totally do this and then it would work and then this whole table would become a single table and you would be able to return this result however this requires the use of window functions which is a a feature that we will see in later lectures but yes it is totally possible and it does solve the problem now here’s a special aggregation expression that you should know about because it is often used which is the count star and count star is simply counting the total number of rows in a table and as you can see if I say from fantasy characters select count star I get the total count of rows in my results and this is a common expression used across all SQL systems to figure out how many rows a table has and of course you can also combine it with filters with the wear clause in order to get other types of measures for example I could say where is alive equals true and then the count would become actually the count of characters who are alive in my data so this is a universal way to count rows in SQL although you should know that if you’re simply interested in the total rows of a table and you are working with bigquery an easy and totally free way to do it is to go to the details Tab and look at the number of rows here so this was all I wanted to tell you about simp Le aggregations for now and last question is why do we call them simple simple as opposed to what I call them simple because the way we’ve seen them until now the aggregations take all of the values of a column and simply return One summary value for example the sum agregation will take all of the values of the level column and then return a single number which is the sum of all levels and more advanced aggregations involved grouping our data for example a question we might ask is what is the average level for Mages as opposed to the average level for Archers and for Hobbits and for warriors and so on so now you’re Computing aggregations not over your whole data but over groups that you find in your data and we will see how to do that in the lecture on groupi but for now you can already find out a lot of interesting stuff about your data by running simple aggregations let us now look at subqueries and Common Table expressions and these are two fundamental functionalities in SQL these functionalities solve a very specific problem and the problem is the following sometimes you just cannot get the result you require with a single query sometimes you have to combine multiple SQL queries to get where you need to go so here’s a fun problem that will illustrate my point so we’re looking at the characters table and we have this requirement we want to find all those characters whose experience is between the minimum and the maximum maximum value of our experience another way to say this we want characters who are more experienced than the least experienced character but less experienced than the most experienced character in other words we want to find that middle ground that is between the least and the most experienced characters so let us see how we could do that uh I have here A Simple Start where I am getting the name and experience column from the characters table now let us focus on the first half of the problem find characters who have more experience than the least experienced character now because this is a toy data set I can sort of eyeball it so I can scroll down here and I can see that the lowest value of experience is pipin with 2100 and so what I need to do now is to filter out from this table all the rows that have this level of experience but apart from eyeballing how would we find the lowest level of experience in our data if you thought of aggregate functions you are right so we have seen a in a previous lecture that we have aggregated functions that take any number of values and speed out a single value that’s a summary for example meing minum and maximum and indeed we need to use a function like that for this problem so your first instinct might be let us take this table and let us filter out rows in this way so let’s say where experience is bigger than the minimum of experience and on the surface this makes sense right I am using an aggregation to get the smallest value of experience and then I’m only keeping rows that have a higher value than that however as you see from this red line this actually does not work because it tells us aggregate function is not allowed in the work Clause so what is going on here so if you followed the lecture on aggregation you might have a clue as to why this doesn’t work but it is good to go back to to it and understand exactly what the problem is so I’m going back to my Google sheet over here where I have the exact same data and I copied our current query down here and now let’s see what happens when SQL tries to run this so SQL goes to the fantasy characters table and the Second Step In The Logical order as you remember is to filter it and for the filter it has to take the column of experience so let me take this column and copy it down here and then it has to compute minimum of experience right so I will Define this column here and I will use Google Sheets function to achieve that result so equals mean and then selecting the numbers and here I get the minimum value of experience and now SQL has to compare these column but this comparison doesn’t work right because these are two columns that have a different number of rows they have a different number of values so SQL is not able to do this comparison you cannot do an element by element comparison between a column that has 15 values and a column that has a single value so SQL throws an error but you might say wait there is a simple solution to this just take this value and copy it all over here until you have two columns of the same size and then you can do the comparison indeed that would work that’s a solution but SQL doesn’t do it automatically whereas if you work with other analytics tools such as pandas in python or npy you will find that um in a situation like this this would be done automatically this would be copied all over here and there’s a process called broadcasting for that but SQL does not take so many assumptions and so many risks with your data if it literally doesn’t work then SQL will not do it so hopefully now you have a better understanding of why this solution does not work so how could we actually approach this problem now a Insight is that I can run a different query so I will open this on the right to find out the minimum experience right I can go back to the characters table and I can select the minimum of experience this is simply what we’ve learned to do in the lecture on aggregations and I get the value here that is the minimum value of experience now that I know the minimum value of experience I could simply copy this value and insert it here into a wear filter and if I run this this will actually work it will solve my problem the issue of course is that I do not want to hard code this value first of all it is not very practical to run a separate query and copy paste the value in the code and second the minimum value might change someday and then I might not remember to update it in my code and then this whole query would become invalid to solve this problem I will use a subquery and I will simply delete the hardcoded value and I will open round brackets which is a way to get started on a subquery and I will take the query that I have over here and put them put it in the round brackets and when I run this I get the result that I need so what exactly is going on here we are using a subquery or in other words a query within a query so when SQL looks at this code it says all right so this is the outer query right and it has a inner query inside it a nested query so I have to start with the innermost query I have to start with the nested query so let me compute this and so SQL runs this query first and then it gets a value out of it which in our case we know that is 2100 and after that SQL substitutes this code over here by the value that was computed and we know from before that this works as expected and to compute the other half of our problem we want our character to have less experience than the most experienced character so this is just another condition in the wear filter and so I can add an end here and copy this code over here except that now I want my experience to be smaller than the maximum of EXP experience in my table now you might know this trick that if you select only part of your code like this and then you click run SQL will only execute that part of the code and so here we get the actual maximum for our experience and we can write it here in the comment and now we know that when SQL runs this query all of these will be computed to 15,000 and then experience will will be compared on that and the query will work as intended and here is the solution to our problem now here’s the second problem which shows another side of subqueries we want to find the difference between a character’s experience and their mentors so let us solve it manually for one case in the characters table so let us look at this character over here which is Saran with id1 and their experience is 8500 and then Saruman has character id6 as their Mentor so if I look for id6 we have Gandalf this is not very Canon compared to the story but let’s just roll with it and Gandalf has 10,000 of experience and now if we select the experience of Gandalf minus the experience of Saran we can see that there is A500 difference between their experience and this is what I want to find with my query now back to my query I will first Alias my columns in order to make them more informative and this is a great trick trick to make problems clearer in your head assign the right names to things so here instead of ID I will call this mentee ID and here I have Mentor ID and here instead of experience I will call this Mente experience so I have just renamed my columns now the missing piece of the puzzle is the mentor experience right so how can I get the mentor experience for example in the first case I know that character 11 is mentored by character 6 how can I get the experience of character six now of course I can take a new tab over here split it to the right go to Fantasy characters filter for ID being equal to six which is the ID of our mentor and get their experience and the experience in this case is 10,000 this is the same example that we saw before but now I would have to write this separate query for each of my rows so here six I’ve already checked but then I will need to check two and seven and one and this is really not feasible right and the solution of course is to solve it with a subquery so what I’m going to do here is open round brackets and in here I will write the code that I need and here I can simply copy the code that I’ve written here get experience from the characters where ID equals six now the six part is still hardcoded because in the first row Mentor ID is six to avoid hardcoding this part there are two components to this the first one is noticing that I am referencing the same table fantasy. characters in two different places in my code and this could get buggy and this could get confusing and the solution is to give separate names to these two instances now what are the right names to give so if we look at this outer query right here this is really information about the M te right because we have the Mente ID the ID of their mentor and the Mente experience so I can simply call this Mente table and as you can see I can Alias my table by simply writing it like this or I could also add the as keyword it would work works just the same on the other hand this table will give us the experience of the mentor this is really information about the mentor so we can call this Mentor table now we’re not going to get confused anymore because these two instances have different names and now what do we want this ID to be if we’re not going to hardcode it we want it to be this value over here we want it to be the mentor ID value from the Mente table we want it to be the M’s mentor and to refer to that column I will get the table name dot the column name so this is telling me get the mentor ID value from mentee table and now that I have the subquery which defines a colum with these two brackets I can Alias the result just like I always do and run this and now you will see after making some room here that we have successfully retrieved The Experience value for the mentor now I realize that this is not the simplest process so let us go back to our query over here and make sure that we understand exactly what is happening now first of all we are going to the characters table which contains information about our mentee the person who is being mentored and we label the table so that we remember what it’s about we filter it because we’re not interested in characters that do not have a mentor and then we’re getting a few data right the ID in this case represents the IDE of the mentee and we also have their Mentor ID and we also have the experience which again this is the table about the Mente represents the mentee experience now our goal is to also get the experience of their Mentor our goal is to see that we have a mentor id6 and we want to know that their experience is 10,000 and we do that with a subquery it’s a query within a query and in this subquery which is an independent piece of SQL code we are going back to the characters table but this is another instance of the table right that we’re looking at so to make sure we remember that we call this Mentor table because it contains information about the mentor and how do we make sure that we get the right value over here that we don’t get confused between separate mentors we make sure that for each row the ID of the character in this table is equal to the mentor ID value in the menty table in other words we make sure that we plug in this value over here in this case six into the table to get the right row and then from that row we get the experience value all of these code over here defines a new column which we call Mentor experience and this is basically the same thing that we did manually when we opened a table on the right and queried the table and copy pasted a hardcoded value this is just the way to do it dynamically with a subquery now we are not fully done with the problem right because we wanted to see the difference between the characters experience and their mentors so let’s see how to do this and the way to do it is with a column calculation just like the ones we’ve seen before so given that this column represents the mentor experience I can remove the Alias over here and over here as well and I can subtract the experience from this and a column minus a column gives me another column which I can then Alias as experience difference and if I I run this I will see the value that we originally computed manually which is the difference between the mentor and the Mente experience there’s nothing really new about this as long as you realize that this expression over here defines a column and this is the reference to a column and so you can subtract them and then give a name an alias to the result and now we can look at our two examples of nested queries side by side and we can figure out what they have in common and where do they differ so what they have in common is that they’re both problem that you cannot resolve with a simple query because you need to use values that you have to compute separately values that you cannot simply refer to by name like we usually do with our columns in this case on the left you need to know what are the minimum and maximum values for experience and in this case on the right you need to know what is the experience of a character’s mentor and so we solve that problem by writing a new query a nested query and making sure that SQL solves this query first gets the result and then plugs that result back back into the original query to get the data we need there is however a subtle difference between these two queries that turns out to be pretty important in practice and I can give you a clue to what this difference is by telling you that on the right we have something that’s called a correlated subquery and on the left we Define this as uncor related subquery now what does this really mean it means that here on the left our subqueries are Computing the minimum and the maximum experience and these are actually fixed values for all of our characters it doesn’t matter which character you’re looking at the whole data set has the same values from minimum experience and maximum experience so you could even imagine comp Computing these values first before running your queries for example you could say minimum experience is the minimum and maximum experience is the max and then you could imagine replacing these values over here right this will not actually work because you cannot Define variables like this in in SQL but on a logical level you can imagine doing this right because you only need to compute these two once I will revert this here so we don’t get confused on the other hand on the right you will see that the value that is returned by sub by this subquery needs to be computed dynamically for every row this value as you also see in the results is different for every row because every row references a different Mentor ID and so SQL cannot compute this one value here for for all rows at once it has to recompute it for every row and this is why we call it a correlated subquery because it’s connected to the value that is in each row and so it must run for each row and an important reason to distinguish between uncorrelated and correlated subqueries is that you can imagine that correlated subqueries are actually slow slower and more expensive to run because you have you’re running a SQL query for every row at least At The Logical level so this was our introduction to subqueries they allow you to implement more complex logic and as long as you understand it logically you’re off to a great start and then by doing exercises and solving problems you will learn with experience when it’s the case to use them in the last lecture we saw that we could use subqueries to retrieve singular values for example what is the minimum value of experience in my data set but we can also use subqueries and Common Table Expressions as well to create new tables all together so here’s a motivating example for that so what I’m doing in this query right here is that I am scaling the value of level based on the character’s class and you might need this in order to create some balance in your game or for whatever reason now what this does is that if the character is Mage the level gets divided by half or multiplied by 0.5 if the character is Archer or Warrior the level we take the 75% of it and in all other cases the level gains 50% so the details are not very important it’s just an example but the point is that we modify the value of level based on the character class and we do this with the case when statement that we saw in a previous lecture and as you can see in the results we get a new value of power level for each character that you can see here but now let’s say that I wanted to filter my my characters based on this new column of power level say that I wanted to only keep characters that have a power level of at least 15 how would I do that well we know that the wear filter can be used to filter rows so you might just want to go here and add a wear statement and say where power level is equal or bigger than 15 but this is not going to work right we know this cannot work because we know how the logical order of SQL operations works and so the case when column that we create power level is defined here at the select stage but the wear filter occurs here at the beginning right after we Source our table so due to our rules the wear component cannot know about this power level column that will actually get created later so the query that we just wrote actually violates the logical order of SQL operations and this is why we cannot filter here now there is actually one thing that I could do here to avoid using a subquery and get around this error and that’s something would be to avoid using this Alias power level that we assigned here and that the we statement cannot know about and replace it with the whole logic of the case when statement so this is going to look pretty ugly but I’m going to do it and if I run this you will see that we in fact get the result we wanted now in the wear lecture we saw that the wear Clause doesn’t just accept simple logical statements you can use all the calculations and all the techniques that are available to you at the select stage and you can also use case when statements and this is why this solution here actually works however this is obviously very ugly and impractical and you should never duplicate code like this so I’m going to remove this wear Clause over here and show you how you can achieve the same result with a subquery so let me first rerun this query over here so that you can see the results and now what I’m going to do I’m going to select this whole logic over here and wrap it in round brackets and then up here I’m going to say select star from and when I run this new query this data that I’m seeing over here should be unchanged so let us run it and you will see that the data has not changed at all but what is actually happening here well it’s pretty simple usually we say select star from fantasy characters right and by this we indicate the name of a table that our system can access but now instead of a table name we are showing a subquery and this subquery is a piece of SQL logic that obviously returns a table so SQL will look at this whole code and we’ll say say okay there is a outer query which is this one and there is an inner query a nested query which is this one so I will compute this one first and then I will treat this as just another table that I can then select from and now because this is just another table we can actually apply a wear filter on top of it we can say where power level is equal or greater than 15 and you will see that we get the result we wanted just like before but now our code looks actually better and the case when logic is not duplicated if you wanted to visualize this in our schema it would look something like this so the flow of data is the following first we run the inner query that works just like all the other queries we’ve seen until now it starts with the from component which gets the table from the database and then it goes through the usual pipeline of SQL logic that eventually produces a result which is a table next that table gets piped into the outer query the outer query also starts with the from component but now the from component is not redem directly from the dat database it is reading the result of the inner query and now the outer query goes through the usual pipeline of components and finally it produces a table and that table is our result and this process could have many levels of nesting because the inner query could reference another query which references another query and eventually we would get to the database but it could take many steps to get there and to demonstrate how multiple levels of nesting works I will go back to my query over here and I will go into my inner query which is this one and this is clearly referencing the table in the database but now instead of referencing the table I will reference yet an other subquery which can be something like from fantasy characters where is alive equals true select star so I will now run this and we have added yet another subquery to our code this was actually not necessary at all you could add the wear filter up here but it is just to demonstrate the fact that you can Nest a lot of queries within each other the other reason I wanted to show you this code is that I hope you will recognize that this is also not a great way of writing code it can get quite confusing and it’s not something that can be easily read and understood one major issue is that it interrupts the natural flow of reading code because you constantly have to interrupt a query because another nested query is beginning within it so you will read select start from and then here another query starts and this is also querying from another subquery and after reading all of these lines you will find this wear filter that actually refers to the outer query that has started many many lines back and if you find this confusing well I think you’re right because it is and the truth is that when you read code on the job or in the wild or when you see solutions that people propose to coding challenges unfortunately this is something that occurs a lot you have subqueries within subqueries within subqueries and very quickly the code becomes impossible to read fortunately there is a better way to handle this and a way that I definitely recommend over this which is to use common table Expressions which we shall see shortly it is however very important that you understand this way of writing subqueries and that you familiarize yourself with it because whether we like it or not a lot of code out there is written like this we’ve seen that we can use the subquery functionality to define a new table on the Fly just by writing some code a new table that we can then query just like any other SQL table and what this allows us to do is to run jobs that are too complex for a single query and to do that without defining new tables in our database and and storing new tables in our database it is essentially a tool to manage complexity and this is how it works for subqueries so instead of saying from and then the name of a table we open round brackets and then we write a independent SQL query in there and we know that every sqle query returns a table and this is the table that we can then work on what we do here is to select star from this table and then apply a filter on this new column that we created in the subquery power level and now I will show you another way to achieve the same result which is through a functionality called Common Table Expressions to build a Common Table expression I will take the logic of this query right here and I will move it up and next I will give a name to this table I will call it power level table and then all I need to say is with power level table as followed by the logic and now this is just another table that is available in my query and it is defined by the logic of what occurs Within the round brackets and so I can refer to this over here and query it just like I need and when I run this you see that we get the same results as before and this is how a Common Table expression works you start with the keyword with you give an alias to the table that you’re going to create you put as open round brackets write an independent query that will of course return a table under this alas over here and then in your code you can query this Alias just like you’ve done until now for any SQL table and although our data result hasn’t changed I would argue that this is a better and more elegant way to achieve the same result because we have separated in the code the logic for the these two different tables instead of putting this logic in between this query and sort of breaking the flow of this table we now have a much cleaner solution where first we Define the virtual table that we will need and by virtual I mean that we treat it like a table but it’s not actually saved in our database it’s still defined by our code and then below that we have the logic that uses this virtual table we can also have multiple Common Table expressions in our query let me show you what that looks like so in our previous example on subquery we added another part where here instead of querying the fantasy characters table we queried a filter on this characters table and it looked like this we were doing select star where is alive equals true so I’m just reproducing what I did in the previous lecture on subqueries now you will notice that this is really not necessary because all we’re doing here is add a wear filter and we could do this in this query directly but please bear with with me because I just want to show you how to handle multiple queries the second thing I want to tell you is although this code actually works and you can verify for yourself I do not recommend doing this meaning mixing Common Table expressions and subqueries it is really not advisable because it adds unnecessary complexity to your code so here we have a common table expression that contains a subquery and I will rather turn this into a situation where we have two common table expressions and no subqueries at all and to do that I will take this logic over here and paste it at the top and I will give this now an alias so I will call it characters alive but you can call it whatever is best for you and then I will do the keyword as add some lines in here to make it more readable and now once we are defining multiple Common Table Expressions we only need to do the with keyword once at the beginning and then we can simply add a comma and please remember this the comma is very important and then we have the Alias of the new table the as keyword and then the logic for that table all that’s needed to do now is to fill in this from because we took away the subquery and we need to query the characters alive virtual table here and this is what it looks like and if you run this you will get your result so this is what the syntax looks like when you have multiple Common Table Expressions you start with the keyword with which you’re only going to need once and then you give the Alias of your first table as keyword and then the logic between round brackets and then for every extra virtual table that you want to add for every extra Common Table expression you only need to add a comma and then another Alias the ask keyword and then the logic between round brackets and when you are done listing your Common Table Expressions you will omit the comma you will not have a comma here because it will break your code and finally you will run your main query and in each of these queries that you can see here you are totally free to query real tables you know material tables that exist in your database as well as common table Expressions that you have defined in this code and in fact you can see that our second virtual table here is quering the first one however be advised that the order in which you write these Common Table Expressions matters because a Common Table expression can only reference Common Table Expressions that came before it it’s not going to be able to see those that came after it so if I say here instead of from fantasy characters I try to query from power level table you will see that I get an error from bigquery because it thinks it doesn’t recognize it basically because the code is below so the order in which you write them matters now an important question to ask is when should I use subqueries and when should I use common table expressions and the truth is that they have a basically equivalent functionality what you can do with the subquery you can do with a common table expression my very opinionated advice is that every time you need to define a new table in your code you should use a Common Table expression because they are simpler easier to understand cleaner and they will make your code more professional in fact I can tell you that in the industry it is a best practice to use common table Expressions instead of subqueries and if I were to interview you for a data job I would definitely pay attention to this issue but there is an exception to this and this is the reason why I’m showing you this query which we wrote in a previous lect lecture on subqueries this is a query where you need to get a single specific value right so if you remember we wanted to get characters whose experience is above the minimum experience in the data and also below the maximum experience so characters that are in the middle to do this we need to dynamically find at any point you know when this query is being run what is the minimum experience and the maximum experience and the subquery is actually great for that you will notice here that we don’t really need to define a whole new table we just really need to get a specific value and this is where a subquery works well because it implements very simple logic and doesn’t actually break the flow of the query but for something more complex like power level table you know this specific query we’re using here which takes the name takes the level then applies a case when logic to level to create a new column called power level you could this do this with a subquery but I actually recommend doing it with a common table expression and this is a cool blog post on this topic by the company DBT it talks about common table expressions in SQL why they are so useful for writing complex SQL code and the best best practices for using Common Table expressions and towards the end of the article there’s also an interesting comparison between Common Table expressions and subqueries and you can see that of CTE Common Table expressions are more readable whereas subqueries are less readable especially if there there are many nested ones so you know a subquery within a subquery within a subquery quickly becomes unreadable recursiveness is a great advantage of CTE although we won’t examine this in detail but basically what this means is that once you define a Common Table expression in your code you can reuse it in any part of your code you can use it in multiple parts right you can use it in other CTE you can use it in your main query and so on on the other hand once you define a subquery you can really only use it in the query in which you defined it you cannot use it in other parts of your code and this is another disadvantage this is a less important factor but when you define a CTE you always need to give it a name whereas subqueries can be anonymous you can see it very well here we of course had to give a name to both of these CTE but the subqueries that we’re using here are Anonymous however I don’t I wouldn’t say that’s a huge difference and finally you have that CTE cannot be used in a work Clause whereas subqueries can and this is exactly the example that I’ve shown you here because this is a simple value that we want to use in our work clause in order to filter our table subqueries are the perfect use case for this whereas CTE are suitable for more complex use cases when you need to Define entire tables in conclusion the article says CTS are essentially temporary views that you can use I’ve used the term virtual table but temporary view works just as well conveys the same idea they are great to give your SQL more structure and readability and they also allow reusability before we move on to other topics I wanted to show you what an amazing tool to Common Table expressions are to create complex data workflows because Common Table expressions are not just a trick to execute certain SQL queries they’re actually a tool that allows us to build data pipelines within our SQL code and that can really give us data superpowers so here I have drawn a typical workflow that you will see in complex SQL queries that make use of Common Table Expressions now what we’re looking at here is a single SQL query it’s however a complex one because it uses CTE and the query is represented graphically here and in a simple code reference here the blue rectangles represent the Common Table Expressions these virtual tables that you can Define with the CTE syntax whereas the Red Square represents the base query the query at the bottom of your code that ultimately will return the result so a typical flow will look like this you will have a first Common Table expression called T1 that is a query that references a real table a table that actually exists in your data set such as fantasy characters and of course this query will do some work right it can apply filters it can calculate new columns and so on everything that we’ve seen until now and then the result of this query gets piped in to another Common Table expression this one is T2 that gets the result of whatever happen happened at T1 and then apply some further logic to it apply some more Transformations and then again the result gets piped into another table where more Transformations run and this can happen for any number of steps until you get to the final query and in the base query we finally compute the end result that will then be returned to the user so this is effectively a dat pipeline that gets data from the source and then applies a series of complex Transformations and this is similar to The Logical schema that we’ve been seeing about SQL right except that this is one level further because in our usual schema the steps are done by Clauses by these components of the SQL queries but here every step is actually a query in itself so of course this is a very powerful feature and this data pipeline applies many queries sequentially until it produces the final result and you can do a lot with this capability and also you should now be able to understand how this is implemented in code so we have our usual CTE syntax with and then the first table we call T1 and then here we have the logic within round brackets for T1 and you can see here that in the from we are referencing a table in the data set and then for every successive Common Table expression we just add a comma a new Alias and the logic comma new Alias and the logic and finally when we’re done we write our base query and you can see that the base query is selecting from T3 T3 is selecting from T2 T2 is selecting from T1 and T1 is selecting from the database but you are not limited to this type of workflow here is another maybe slightly more complex workflow that you will also see in the wild and here you can see that at the top we have two common table Expressions that reference the the database so you can see here like like the first one is getting data from table one and then transforming it the second one is getting data from table two and then transforming it and next we have the third CTE that’s actually combining data from these two tables over here so we haven’t yet seen how to combine data except through the union um I wrote The Joint here which we’re going to see shortly but all you need to know is that T3 is combining data from this these two parent tables and then finally the base query is not only using the data from T3 but also going back to T1 and using that data as well and you remember we said that great thing about ctes is that tables are reusable you define them once and then you can use them anywhere well here’s an example with T1 because T1 is defined here at the top of the code and then it is referenced by T3 but it is also referenced by the base query so this is another example of a workflow that you could have and really the limit here is your imagination and the complexity of your needs you can have complex workflows such as this one which can Implement very complex data requirements so this is a short overview of the power of CTE and I hope you’re excited to learn about them and to use them in your sequel challenges we now move on to joints which are a powerful way to bring many different tables together and combine their information and I’m going to start us off here with a little motivating example now on the left here I see my characters table and by now we’re familiar with this table so let’s say that I wanted to know for each character how many items they are carrying in their inventory now you will notice that this information is not available in the characters table however this information is available in the inventory table so how exactly does the inventory table works when you are looking at a table for the first time and you want to understand how it works the best question you can ask is the following what does each row represent so what does each row represent in this table well if we look at the columns we can see that for every row of this table we have a specific character id and an item id as well as a quantity and some other information as well such as whether the item is equipped when it was purchased and and so on so looking at this I realized that each row in this table represents a fact the fact that a character has an item right so I know by looking at this table that character id 2 has item 101 and character ID3 has item six and so on so clearly I can use this in order order to answer my question so how many items is Gandalf carrying to find this out I have to look up the ID of Gandalf which as you can see here is six and then I have to go to the inventory table and in the character id column look for the ID of Gandalf right now unfortunately it’s not ordered but I can look for myself here and I can see that at least this row is related to Gandalf because he has character id6 and I can see that Gandalf has item id 16 in his inventory and I’m actually seeing another one now which is this one which is 11 and I’m not seeing anyone uh any other item at the moment so for now based on my imperfect uh visual analysis is I can say that Gandalf has two items in his inventory of course our analysis skills are not limited to eyeballing stuff right we have learned that we can search uh a table for the information we need so I could go here and query the inventory table in a new tab right and I could say give me um from the inventory table where character id equals 6 this should give me all the information for Gandalf and I could say give me all the columns and when I run this I should see that indeed we have uh two rows here and we know that Gandalf has items 16 and 11 in his inventory we don’t know exactly what these items are but we know that he’s carrying two items so that’s a good start okay but uh what if I wanted to know which items Frodo is carrying well again I can go to the characters table and uh look up the name Frodo and I find out that Frodo is id4 so going here I can just plug that uh number into my we filter and I will find out that Frodo is carrying a single type of item which has id9 although it’s in a quantity of two and of course I could go on and do this for every character but it is quite impractical to change the filter every time and what if I wanted to know how many items each character is carrying or at least which items each character is carrying all at once well this is where joints come into play what I really want to do in this case is to combine these two tables into one and by bringing them together to create a new table which will have all of the information that I need so let’s see how to do this now the first question we must answer is what unites these two tables what connects them what can we use in order to combine them and actually we’ve already seen this in our example um the inventory table has a character id field which is actually referring to the ID of the character in the character’s table so we have two columns here the character id column in inventory and the ID column in characters which actually represent the same thing the identifier for a character and this logical connection the fact that these columns repres repr the same thing can be used in order to combine these tables so let me start a fresh query over here and as usual I will start with the from part now where do I want to get my data from I want to get my data from the characters table just as we’ve been doing until now however the characters table is not not enough for me anymore I need to join this table on the fantasy. inventory table so I want to join these two tables how do I want to join these two tables well we know that the inventory table has a character id column which is the same as the character tables ID column so like we said before these two columns from the different tables they represent the same thing so there’s a logical connection between them and we will use it for the join and I want to draw your attention to the notation that we’re using here because in this query we have two tables present and so it is not enough to Simply write the name of columns it is also necessary to specify to which table each column belongs and we do it with this notation so the inventory. character uh is saying that the we are talking about the character id colum in the inventory table and the ID column in the characters table so it’s important to write columns with this notation in order to avoid ambiguity when you have more than one table in your your query so until now we have used the from uh Clause to specify where do we want to get data from and normally this was simply specifying the name of a table here we are doing something very similar except that we are creating a new table that is obtained by combining two pre-existing tables okay so we are not getting our data from the characters table and we are not getting it from the inventory table but we are getting it from a brand new table that we have created by combining these two and this is where our data lives and to complete the query for now we can simply add a select star and you will now see the result of this query so let me actually make some room here and expand these results so I can show you what we got and as you can see here we have a brand new table in our result and you will notice if you check the columns that this table includes all of the columns from the characters table and also all of the columns from the inventory table as as you can see here and they have been combined by our join statement now to get a better sense of what’s Happening let us get rid of this star and let us actually select the columns that we’re interested in and once again I will write columns with this notation in order to avoid ambiguity and in selecting these columns uh I will remind you that we have all of the columns from the characters table and all of the columns from the inventory table to choose from so what I will do here is that I will take the ID columns from characters and I will take the name column from characters and then I will want to see the ID of the item so I will take the inventory table and the item id column from that table and from the inventory table I will also want to see the quantity of each item and to make our results clearer I will order my results by the characters ID and the item ID and you can see here that we get the result that we needed we have all of our characters here with their IDs and their name and then for each character we can tell which items are in their inventory so you can see here that Aragorn has item id4 in his inventory in quantity of two he also has Item 99 so because of this Aragorn has two rows if we look back at Frodo we see the uh information that we retrieved before and the same for Gandalf who has these two items so we have combined the characters table and the inventory table to get the information that we needed what does each row represent in our result well it’s the same as the inventory table each row is a fact which is that a certain character possesses a certain item but unlike the inventory table we now have all the information we want for a character and not just the ID so here we’ve uh we’re showing the name of each character but we could of course select more columns and get more information for each character as needed now a short note on notation when you see SQL code in the wild and u a query is joining on two or more tables people uh you know programmers were usually quite lazy and we don’t feel like writing the name of the table all all of the time right like we we’re doing in this case with characters so what we usually do is that we add an alias um on the table like this so from fantasy characters call it C we will join on inventory call it I and then basically we use this Alias um everywhere in the query both in the instructions for joining and in the column names and the same with characters so I will substitute everything here and and yes maybe it’s a bit less readable but it’s faster to write and we programmers are quite lazy so we’ll often see this notation and you will often also see that in the code we omit the as keyword which can be let’s say implicit in SQL code and so we write it like this from fantasy. character C join uh fantasy. inventory i and then C and I refer to the two tables that we’re joining and I can run this and show you that the query works just as well now we’ve seen why join is useful and how it looks like but now I want you to get a detailed understanding of how exactly the logic of join works and for this I’m going to go back to my spreadsheet and what I have here is my characters table and my inventory table these are just like you’ve seen them in big query except that I’m only taking um four rows each in order to make it simpler for the example and what you see here is the same query that I’ve just run on big query this is a t a query that takes the characters table joins it on the inventory table on this particular condition and then picks a few columns from this so let us see how to simulate this query in Google Sheets now the first thing I need to do is to build the table that I will run my query on because as we’ve said before the from part is now referencing not the characters table not the inventory table but the new table which is built by combining these two and so our first job is to build this new table and the first step to building this new table is to take all of the columns from characters and put them in the new table and then take all of the columns from inventory and then put them in the new table and what we’ve obtained here is the structure of our new table the structure of our new table is uh simply created by taking all of The Columns of the T table on the left along with all of the columns from the table on the right now I will go through each character in turn and consider the join condition the join condition is that the ID of a character is present in the character id column of inventory so let us look at my first character um we have Aragorn and he has ID one now is this ID present in the character id column yes I see it here in the first row so we have a match given that we have a match I will take all of the data that I have in the characters table for Aragorn and then I will take all of the data in the inventory table for the row that matches and I have built here my first row do I have any other Row in the inventory table that matches yes the second row also has a character id of one so because I have another match I will repeat the operation I will will take all of the data that I have in the left table for Aragorn and I will add all of the data from the right column in the row that matches now there are no more matches for id1 uh in the inventory table so I can proceed and I will proceed with Legolas he has character id of two question is there any row that has the value two in the character id column yes I can see it here so I have another match so just like before I will take the information for Legolas and paste it here and then I will take the matching row which is this one and paste it here we move on to gimly because there’s no other matches for Legolas now gimly has ID3 and I can see a match over here so I will take the row for gimly paste it here and then take the matching row character id 3 and paste it here great finally we come to Frodo character id for is there any match for this character I can actually find no match at all so I do nothing this row does not come into the resulting table because there is no match and this completes the job of this part of the query over here building the table that comes from joining these two tables this is my resulting table and now to complete the query I simply have to pick the columns that the query asks for so the First Column is character. ID which is this column over here so I will take it and I will put it in my result the second column I want is character. name which is this column over here the third column is the item id column which is this one right here and finally I have quantity which is this one right here and this is the final result of my query and of course this is just like any other SQL table so I can use all of the other things I’ve learned to run Logic on this table for example I might only want to keep items that are present in a quantity of two and so to do that I will simply add a wear filter here and I will refer uh the inventory table because that’s the parent table of the quantity column so I will say I will say i. quantity um bigger or equal to two and then how my query will work is that first it will build this table like we’ve seen so it will do this stage first and then it will run the wear filter on this table and it will only keep the rows where quantity is at least two and so as a result we will only get this row over here instead of this result that we see right here H except that um we will of course also have to only keep the columns that are specified in the select statement so we will get ID name um Item ID and quantity so this will be the result of my query after I’ve added a wear filter so let us actually take this and add it to B query and make sure that it works so so I have to add that after the from part and before the order by part right this is the order and after I run this I will see that indeed I get um Aragorn and Frodo is not exactly the same as in our sheet but that’s because our sheet has um less data but uh this is what we want to achieve and now let us go back to our super important diagram of the order of SQL operation and let us ask ourselves where does the join fit in in this schema and as you can see I have placed join at the very beginning of our flow together with the from because the truth is that the joint Clause is not really separate from the from CL Clause they are actually one and the same component in The Logical order of operations so as you remember the first stage specifies where our data lives where we do we want to get our data from and until now we were content to answer this question with a single table name with the address of a single table because all the data we needed was in just one table and now instead of doing this we are taking it a step further we are saying our data lives in a particular combination of two or more tables so let me tell you which tables I want to combine and how I want to combine them and the result of this will be of course yet another table and then this table will be the beginning of my flow and after that I can apply all the other operations that I’ve come to know uh on my table and it will work just like U all our previous examples the result of a join is of course just another table so when you look at a SQL query and this query includes a join you really have to see it as one and the same with the front part it defines the source of your data by combining tables and everything else that you do will be applied not to a single table not to any of the tables that you’re combining everything that you do will be applied to the resultant table that comes from this combination and this is why from and join are really the same component and this is why they are the first step in The Logical order of SQL operations let us now briefly look at multiple joints because sometimes the data that you need is in three tables or four tables and you can actually join as many tables as you want uh or at least as many tables as your system uh allows you to join before it becomes too slow so we have our example here from before we have each character and we have their name and we know which items are in their inventory but we actually don’t know what the items are we just know their ID so how can I know uh that if Aragorn has item four what item does Aragorn actually have what is the name of this item now obviously this information is available in the items table that you have here on the right and you can see here that we have a name column and just like before I can actually eyeball it I can look for it myself I know that I’m looking for item id 4 and if I go here and uh I go to four I can see that this item is a healing potion and now let us see how we can add this with the join so now I will go to my query and after joining with characters in inventory I will take that result and simply join it on a third table so I will write join on fantasy. items and I can call this it to use a uh brief form uh because I am lazy as all programmers are and now I need to specify the condition on which to join so the condition is that the item ID column which actually came from the inventory table right that’s its parent so I’m going to call it inventory. item um ID except that yeah I’m referring to inventory as a simple I that is the brief form is the same as the items table the ID column in the items table and now that I’ve added my condition the data that I’m searcing is now a combination of these three tables and in my result I now have access to The Columns of the items table and I can access these columns simply by referring to them so I will say it. name and some other thing it. power and after I run this query I should be able for each item to see the name and the power right so Aragorn has a healing potion with power of 50 Legolas has a Elven bow with power of 85 and so on now you may have noticed something a bit curious and it’s that name here is actually written as name1 and can you figure out why this is happening well well it’s happening because there’s an ambiguity right the characters table has a column called name and the items table also has a column called name and because bigquery is not referring to the columns the way we are doing it right by saying the the parent table and then the name of the column it uh it would find itself in a position of having two identically named columns so the second one uh it tries to distinguish it by adding underscore one and how we can remedy this is by renaming the column to something more meaningful for example we could say call this item name which would be a lot clearer for whoever looks at the result of our query and as you can see now the name makes more sense so you can see that the multiple join is actually nothing new because when we join the first time like we did before we have combined two two tables into a new one and then this new table gets joined to a third table so it’s simply repeating the join operation twice it’s nothing actually new but let us actually simulate a multiple join in our spreadsheet to make sure that we understand it and that it’s nothing new so again I have our tables here but I have added the items table which we will combine and I’ve written here our query right so take the characters table and join it with inventory uh like we did before and then take the result of that table and join it to items and here we have the condition so the first thing we need to do is to process our first join and this is actually exactly what we’ve done before so let us do it again first of all the combined table uh characters and inventory its structure is obtained by taking all the columns of characters and then all the columns of inventory and putting them side by side and this is the result table now for the logic of this table I will now do it faster because we’ve done it before but basically we get the first character id1 it has two matches so I’ll actually take this values and put them into two rows and for the inventory part I will simply call copy these two rows to um complete my match then we have Legolas there is one match here so I will take the left side and I will take so I’m looking for id2 so I will take this row over here that’s all we have and then we have gimle and he also has one match so I’ll will take it here and the resulting column and then finally Frodo has no match so I will not add him to my result this is exactly what we’ve done before so now that we have this new table we can proceed with our next join which is with items okay so the resulting table will be the result of our first join combined with items and to show you that we’ve already computed uh this and now it’s one table I have added round brackets now the rules for joining are just the same so take all of the columns in the left side table and then take all of the columns in the right side table and now we have the resulting structure of our table and then let us go through every row so let us look at the first row what does the joint condition say Item ID needs to be in the ID table of items so I can see a match here so I will simply take this row on the left side and the matching row on the right side and add it here second row the item ID is four do I have a match yes I can see that I have a match so I will paste the row on the left and the mat matching row on the right third column item id 2 do I have a match no I don’t so I don’t need to do anything and in the final row item id 101 I don’t see a match so I don’t have to do anything and so this is my final result in short multiple join works just like a normal join combine the first two tables get the resulting table and then keep doing this until you run out of joins now there’s another special case of join uh which is the self join and this is something that people who are getting started with SQL tend to find confusing but I want to show you that there’s nothing uh confusing about it because really it’s just a regular join that works just like all the other joints that we’ve seen there’s nothing actually special about it so we can see here uh the characters table and you might remember that for each character we are we have a column of Mentor ID now in a lot of cases this column has value null so it means that there’s nothing there but in some cases there is a value there and what this means is that this particular character so we are looking at number three uh that is Saruman uh this particular character has a mentor and who is this Mentor uh all we know is that their ID is six and it turns out that the ID in this column is referring to the ID in the characters table so to find out who six is I just have to look who has an ID of six and I can see that it is Gandalf so by eyeballing it I know that San has a mentor and that Mentor is Gandalf and then elron also has the same Mentor which is Gand so I can solve this by eyeballing the table but how can I get a table that shows for each character who has a mentor who their Mentor is it turns out that I have to take the character’s table and join it on the characters table on itself so let’s see how that works in practice so let me start a new query here on the right and so my goal here is to list every character in the table and then to also show their Mentor if they have one so I will of course have to get the characters table for this and the first time I take this table it is simply to list all of the characters right so to remind myself of that I can give it a label which is chars now as you know each character has a mentor ID value and but to find out who like what is the name of this Mentor I actually need to look it up in the characters table so to do this I will join on another instance of the characters table right this is another let’s say copy of the same data but now I’m going to use it for a different purpose I will not use it to list my characters I will use it to get the name of the mentor so I will call this mentors to reflect this use now what is The Logical connection between these two copies of the characters table each character in my list of characters has a mentor ID field and I want to match this on the the ID field of my mentor table so this is The Logical connection that I’m looking for and I can now add a select star to quickly complete my query and see the results over here so the resulting table has all of The Columns of the left table and all of The Columns of the right table which means that the columns of the characters table will be repeated uh twice in the result as you can see here but on the left I simply have my list of characters okay so the first one is Saruman and then on the right I have the data about their Mentor so Saran has a mentor ID of six and then here starts the data about the mentor he has ID of six and his name is Gandalf so you can see here that our self jooin has worked as intended but this is actually a bit messy uh we don’t need uh all of these columns so let us now select Only The Columns that we need so from my list of characters I want the name and then from the corresponding Mentor I also want the name and I will label these columns so that they make sense to whoever is looking at my data so I will call this character character name and I will call this Mentor name and when I run this query you can see that quite simply we get what we wanted we have the list of all our characters at least the ones who have a mentor and for each character we can see the name of their Mentor so a self join works just like any other join and the key to avoiding confusion is to realize that you are joining on two different copies of the same data okay you’re not actually joining on the same exact table so one copy of fantasy characters we call characters and we use for a purpose and then a second copy we call mentors and we use for another purpose and when you realize this you see that you are simply joining two tables uh and all the rules that you’ve learned about normal joints apply it just so happens that in this case the two tables are identical because you’re getting the data from the same source and to drive the point home let us quickly simulate this in our trusty spreadsheet and so as you can see here uh I have the query that I’ve run in B query and we’re now going to simulate it so the important thing to see here is that that we’re not actually joining one table to itself although that’s what it looks like we’re actually joining two tables which just happen to look the same okay and so one is called chars and one is called mentors based on the label that we’ve given them but then once we join them the rules are just the same as we’ve seen until now so to create the structure of the resulting table take all the columns from the left left and then take all the columns from the right and then go row by row and look for matches based on on the condition now the condition is that Mentor ID in chars needs to be in the ID column of mentors so first row Aragorn has Mentor 2 is this in the ID column yes I can see a match here so let me take all the values from here and all the values from the matching rows paste them together are there any other matches no second row we’re looking for Mentor ID 4 do we have a match yes I can see it here so let me take all of the values from the left and all of the values from the matching row on the right now we have two more rows but but as you can see in both cases Mentor ID is null which means that they have no mentor and basically for the purposes of the join we can ignore these rows we are not going to find a match in these rows in fact as an aside even if there was a character whose ID was null uh we wouldn’t match with Mentor ID null on a character whose ID was null because in squl in a sense null does not equal null because null is not a specific value but it represents the absence of data so in short when Mentor ID is null we can be sure that in this case uh there will be no match and the row will not appear in the join now that we have our result we simply need to select the columns that we want and so the first one is name which comes from the charge table which is this one over here and the second one is name that comes from the mentor table which is this one over here and here is our result so that’s how a self join works so until now we have seen uh joint conditions which are pretty strict and and straightforward right so there’s a column in the left table and there’s a column in the right table and they represent the same thing and then you look for an exact match between those two columns and typically they’re an ID number right so one table has the item id the other table also has the item ID and then you look for an exact match and if there’s an exact match you include the row in the join otherwise not that’s pretty straightforward but what I want to show you here is that the join is actually much more flexible and and powerful than that and you don’t always need you know two columns that represent the exact same thing or an exact match in order to write a joining condition in fact you can create your own you know complex conditions and combinations that decide how to join two tables and for this you can simply use the Boolean algebra magic that we’ve learned about in this course and that we’ve been using for example when working on the wear filter so so let us see how this works in practice now I’ve tried to come up with an example that will illustrate this so let’s say that we have a game you know board game or video game or whatever and we have our characters and we have our items okay and in our game um a character cannot simply use all of the items in the world okay there is a limit to which items a character can use and a limit is based on the following rule um let me write it here as a comment and then we will uh use it in our logic so a character can use any item for which the power level is equal or greater than the characters experience divided by 100 okay so this is just a rule uh that exists in our game and now let us say that we wanted to get a list of all characters and the items that they can use okay and this is clearly uh a case where we would need a join so let us actually write this query I will start by getting my data from fantasy. characters and I will call this c as a shorthand and I will need to join on the items table right and what is the condition of the join the condition of the join is that the character’s experience divided by 100 is greater or equal than the items power level and I forgot here to add a short hand I for the items table so this is the condition that refects our Rule and out of this table that I’ve created I would like to see the characters name and the characters experience divided by 100 and then I would like to see the items name and the items power to make sure that my um join is working as intended so let us run this and look at the result so this looks a bit weird because we haven’t given a label to this column but basically I can see um that I have Gandalf and his experience divided by 100 is 100 and he can wear the item Excalibur that has a power of 100 which satisfies our condition let me actually order by character name so that I can see in one place all of the items that a character can wear so we can see that Aragorn is first and his experience divided by 100 is 90 and then uh this is the same in all all of these rows that we see right now but then we see all of the items that Aragorn is allowed to use and we see their power and in each case you will see that their power does not exceed this value on the left so the condition uh that we wrote works as intended so as you can see what we have here is a Boolean expression just like the ones we’ve seen before which is a logical statement that eventually if you run it it evaluates to either true or false and all of the rules that we’ve seen for Boolean Expressions apply here as well for example I can decide that this rule over here does not apply to Mages because Mages are special and then I can say that if a character is Mage then I want them to be able to use all of the items well how can I do this in this query can you pause the video and figure it out so what I can do is to Simply expand my Boolean expression by adding an or right and what I want to test for is that character class equals Mage so let me check for a second that I have class and I have Mage so this should work and if I run this going through the result I will not do it but you can uh do it yourself and and verify for yourself that if a character is a Mage you will find out that they can use all of the items and this of course is just a Boolean expression um in which you have two statements connected by an or so if any of this is true if at least one of these two is true then the whole statement will evaluate to true and so the row will match if you have trouble seeing this then go back to the video on the Boolean algebra and uh everything is explained in there so this is just what we did before when we simulated The Joint in the spreadsheet you can imagine taking the left side table which is uh characters and then going row by row and then for the first row you check all of the rows in the right side table which is items all of the rows that have a match but this time you won’t check if the ID corresponds you will actually run this expression to see whether there is a match and when this expression evaluates as true you consider that to be a match and you include the row in the join however if this condition does not evaluate to true it’s not a match and so the row is not included in the join so this is simply a generalization from the exact match which shows you that you can use any conditions in order to join uh two tables now I’ve been pretending that there is only one type of join in SQL but that is actually not true there are a few different types of join that we need to know so let us see uh what they are and how they work now this is the query that we wrote before and this is exactly how we’ve written it before and as you can see we’ve simply specified join but uh it turns out that what we were doing all the time was something called inner join okay and now that I’ve written it explicitly you can see that if I rerun the query I will get exactly the same results and this is because the inner join is by far the most common type of join that you find in SQL and so in many uh styles of SQL such as the one used by bigquery they allow you to skip this specification and they allow you to Simply write join and then it is considered as an inner join so when you want to do an inner join you have the choice whether to specify it explicitly or to Simply write join but what I want to show you you now is another type of join called Left join okay and to see how that works I want to show you um how we can simulate this query in the spreadsheet so as you can see this is very similar to what we’ve done before I have the query uh that I want to simulate and notice the left join and then I have my two tables now what is the purpose of the left join in the previous examples which were featuring the inner join we’ve seen that when we combine two tables with an inner join the resulting table will only have rows that have a match in both tables okay so what we did is that we went through every Row in the characters table and if it had a match in the inventory table we kept that row but if there was no match we completely discarded that row but what if we wanted in our resulting table to see all of the characters to make sure that our list of characters was complete regardless of whether they had a match in the inventory table this is what left join is for left join exists so that we can keep all of the rows in the left table whether they have a match or not so let us see that in practice okay so when we need to do a left join between characters and inventory so first of all I need to determine the structure of the resulting table and to do this I will take all of the columns from the left table and all of the columns from the right table nothing new there next step let us go row by Row in the left table and look for matches so we have Aragorn and he actually has two matches uh by now we’ve uh remembered this so these two rows have a match in character id with the ID of characters so I will take these two rows and add them to my resulting table next is Legolas and I see a match here so I will take the rows where Legolas matches and put it here it’s only one row actually gimly has also a single match so I will create the row over here um and so this is the match for gimly and of course I can ensure that I’m doing things correctly by looking at this ID column and uh this character id column over here and they have to be identical right if they’re not then I’ve made a mistake and finally we come to Frodo now Frodo you will see does not have a match in this table so before we basically discarded this row because it had no match right now though we are dealing with the left join that means that all of the rows in the characters table need to be included so I don’t have a choice I need to take this row and include it and add it here and now the question is what values will I put in here well I cannot put any value from the inventory table because I don’t have a match so the only thing that I can do is to put NS in here NS of course represent the absence of data so they’re perfect for this use case and that basically completes uh the sourcing part of our left join now you may have noticed that there is an extra row here in inventory which does not have a match right it is referred into character id 10 but there is no character id 10 so here the frao row also did not have a match but we included it so should we include this row as well the answer is no why not because this is a left joint okay so left joint means that we include all of the rows in the left table even if they don’t have a match but we do not include rows in the right table when they do not have a match okay this this is why it’s a left join so but if you’re still confused about this don’t worry because it will become clearer once we see the other types of join and of course for the sake of completeness I can actually finish the query by selecting my columns which would be the uh character id and the character name and the item ID and the item quantity and this is my final result and in the case of Frodo we have null values which tells us that this row found no match in the right table which in this case means that Frodo does not have any items now that you understand the left join you can also easily understand the right joint it is simply the symmetrical operation to the left joint right right so whether you do characters left joint inventory or you do inventory right join characters the result will be identical it’s just the symmetrical operation right this is why I wrote here that table a left joint b equals table B right joint a so hopefully that’s pretty intuitive but of course if I I did characters right join inventory then the results would be reversed because I would have to keep all of the rows of inventory regardless of whether they have a match or not and only keep rows in characters which have a match so if you experiment for yourself on the data you will easily convince yourself of this result let us now see the left joint in practice so remember the query from before um where we take each character and then we see their Mentor this is the code exactly as we’ve written it before and so now you know that this is an inner join because when you don’t specify what type of join you want SQL assumes it’s an inner join at least that’s what the SQL in bigquery does and you can see that if I write inner join um I think I have a typo there uh the result is absolutely identical and in this case we’re only including characters who have a mentor right we are missing out on characters who don’t have a mentor meaning that Mentor ID is null because in the inner join there is no match and so they are discarded but what would happen if I went here and instead turn this into a left join what I expect to happen is that I will keep all of my characters so all of the rows from the left side table regardless of whether they have a match or not regardless of whether they have a mentor or not and so let us run this and let us see that this is in fact the case I now have a row for each of my characters and I have a row for Gandalf even though Gandalf does not have mentor and so I have a null value in here so the left join allows me to keep all of the rows of the left table now we’ve seen the inner join the left join and the right join which are really the same thing just symmetrical to each other and finally I want to show you the full outer join this is the last type of join that I want to that I want to show you now you will see that a full outer joint is like a combination of all of the joints that we’ve seen until now so a full outer join gives us all of the rows uh that have a match in the two tables plus all of the rows in the left table that don’t have a match with the right table plus all of the rows in the right table that don’t have a match in the left table so let us see how that works in practice what I have here is our usual query but now as you can see I have specified a full outer join so let us now simulate this join between the two tables now the first step as usual is to take all of the columns from the left table and all of the columns from the right table to get the structure of the resulting table and now I will go row by Row in the left table so as usual we have Aragorn and you know what I’m already going to copy it here because even if there’s not a match I still have to keep this row uh because this is a full outer joint and I’m basically not discarding any row now that I’ve copied it is there a match well I already know from the previous examples that there are two rows uh in the inventory table that match because they have character id one so I’m just going to take them and copy them over here and in the second row I will need to replicate these values perfect let me move on to Legolas and again I can already paste it because there’s no way that I’m going to discard this row but of course we know that Legolas has a m match and moving quickly cuz we’ve already seen this gimly has a match as well and now we come to Frodo now Frodo again I can already copy it because I’m keeping all the rows but Frodo does not have a match so just like before with the left join I’m going to keep this row but I’m going to add null values in the columns that come from the invent table so now I’ve been through all of the rows in the left table but I’m not done yet with my join because in a full outer join I have to also include all of the rows from the right table so now the question is are there any rows in the inventory table that I have not considered yet and for this I can check the inventory ID from my result 1 2 3 4 and compare it with the ID from my table 1 2 3 4 5 and then I realize that I have not included row number five because it was not selected by any match but since this is a full outer join I will add this row over here I will copy it and of course it has no correspondent uh in the left table so what do I do once again I will insert null values and that completes the first phase of my full outer join the last phase is always the same right pick the columns that are listed in the select so you have the ID the name Item ID and quantity and this completes my full outer join so remember how I said that a full outer join is like an inner join plus a left join plus a right join here is a visualization that demonstrates now in the result the green rows are the rows in which you have a match on the left table and the right table right and these rows correspond to the inner join and if you run an inner join this this will be the only rows that are returned right now the purple row is including a row that is present in the left table but does not have any match in the right table so if you were to run a left join what would the result be a left joint would include all of the green rows because they have a match and and additionally they would also include the purple row because in the left joint you keep all of the rows from the left if on the other hand you were to run a right join and you wouldn’t like swap the names of the tables or anything right you would do characters right join inventory you would get of course all of the green rows because they are a match Additionally you would get the blue row at the end because this row is present in the right table even though there’s no match and in the right join we want to keep all the rows that are in the right table and finally in a full outer join you will include all of these rows right so first of all all of the rows that have a match and then all of the rows in the left table even though they don’t have a match and finally all of the rows in the right table even though they don’t have a match and these are the three or four types of joint that you need to know and that you will find useful in solving your problems now here’s yet another way to think about joints in SQL and to visualize joints which you might find helpful so one way to think about SQL tables is that a table is a set of rows and that joints correspond to different ways of uh combining sets and you might remember this from school this is a v diagram it represents the relation uh between uh two sets and the elements that are inside these two sets so you can take set a to be our left table uh containing all of the rows from um the left table and set B to be our right table with all of the rows from the right table and in the middle here you can see that there is an intersection between the sets this intersection represents the rows that have a match uh so this would be the rows that I have colored green in our example over here so what will happen if I select if I want to see only the rows that are a match only the rows that belong in both tables let me select this now and you can see that this corresponds to an inner joint because I only want to get the rows that have a match then what would happen if I wanted to include all of the rows in the left table regardless of whether they have a match or not to what type of join does that correspond I will select it here and you can see that that corresponds to a left join the left join produces a complete set of records from table a with the matching records in table B if there is no match the right side will contain null likewise if I wanted to keep all of the rows in uh table B including the ones that match with a I would of course get a right join which is just symmetrical to a left join finally what would I have to do to include all of the rows from both tables regardless of whether they have a match or not if I do this then I will get a full outer join so this is just one way to visualize what we’ve already seen there is one more thing you can actually realize from this uh tool which is in some cases you might want to get all of the records that are in a except those that match in B so all of the record that records that a does not have in common with b and you can see how you can actually do this this is actually a left join with an added filter where the b key is null so what does that mean the meaning will be clear if I go back to our example for the left join you can see that this is our result for the left join and because Frodo had no match in the right table the ID column over here is null so if I take this table and I apply a filter where ID where inventory ID is null I will only get this result over here and this is exactly the one row in the left table that does not have a match in the right table so this is more of a special case you don’t actually see this a lot in practice but I wanted it wanted to show it briefly to you in case you try it and get curious about it likewise the last thing that you can do you could get all of the rows from A and B that do not have a match so the set of Records unique to table a and table B and this is actually very similar you do a full outer join and you check that either key is null so either inventory ID is null or character id is null and if you apply that filter you will get these two rows which is the set of rows that are in a and only in a plus the rows that are in B and only in B once again I’ve honestly never used this in practice I’m just telling you for the sake of completeness in case you get curious about it now a brief but very important note on how SQL organizes data so you might remember from the start of the course that I’ve told you that in a way SQL tables are quite similar to spreadsheet tables but there are two fundamental difference one difference is that each SQL table has a fixed schema meaning we always know what the columns are and what type of data they contain and we’ve seen how this works extensively the second thing was that SQL tables are actually connected with each other which makes SQL very powerful and now we are finally in a position to understand just exactly how SQL tables can be connected with each other and this will allow you to understand how SQL represents data so I came here to DB diagram. which is a very uh nice website for building representations of SQL data and this is uh this type of um of chart of representation that we see here is also known as ER as you can see me writing here which is stands for entity relationship diagram and it’s basically a diagram that shows you how your data is organized in your SQL system and so you can see a representation of each table uh this is the example that’s shown on the web website and so you have three tables here users follows and posts and then for each table you can see the schema right you can see that the users table has four columns one is the user ID which is an integer the other is the username which is varar this is another way of saying string so this is a piece of text rooll is also a piece of text and then you have a Tim stamp that shows when the user was created and the important thing to notice here is that these tables are actually they’re not they don’t exist in isolation but they are connected with each other they are connected through these arrows that you see here and what do these arrows represent well let’s look at the follows table okay so each row of this table is a fact shows that one user follows another and so in each row you see the ID of the user who follows and the ID of the user who is followed as well as the time when this event happened and what are these uh arrows telling us they’re telling us that the IDS in this table are the same thing as the user ID column in this table which means that you can join the follows table with the users table to get the information about the two users that are here the user who is following and the user who is followed so like we’ve seen before a table has a column which is the same thing as another tables column which means that you can join them to combine their data and this is how in SQL several tables are connected with each other they are connected by logical correspondences that allow you to join those tables and combine their data likewise you have the post table and each row represent a post and each post post has a user ID and what this arrow is telling you is that uh you can join on the user table using this ID to get all the information you need about the user who has created this post now of course as we have seen you are not limited to joining the tables along these lines you can actually join these tables on whatever condition you can think of but this is a guarantee of consistency between these tables that comes from how the data was distributed and it’s a guarantee it’s a promise that you can get the data you need by joining on these specific columns and that is really all you need to know in order to get started with joints and use them to explore your data and solve SQL problems to conclude this section I want to go back to our diagram and to remind you that from and join are really one and the same they are the way for you to get the data that you need in order to answer your question and so when the data is in one table alone you can get away with just um using the from and then specifying the name of the table but often your data will be distributed in many different tables so you can look at the ER diagram such as this one if you have it to figure out how your uh data works and then once you decided which tables you want to combine you can write a from which combines with a join and so create a new table uh which is a combination of two or more tables and then all of the other operations that you’ve learned will run on top of that table we are finally ready for a in-depth discussion of grouping and aggregations in SQL and why is this important well as you can see I have asked Chad GPT to show me some typical business questions that can be answered by data aggregation so let’s see what we have here What’s the total revenue by quarter how many units did did each product sell last month what is the average customer spent per transaction which region has the highest number of sales now as you can see these are some of the most common and fundamental business questions um that you would be asking when you do analytics and this is why grouping and aggregation are so important when we talk about SQL now let’s open our date once again in the spreadsheet and see what we might achieve through aggregation so I have copied here four columns from my characters table Guild class level and experience and I’m going to be asking a few questions the first question which you can see here is what are the level measures by class so what does this mean well earlier in the course we looked at aggregations and we call them simple aggregations because we were running them over the whole table so you might remember that if I select the values for level here I will get a few different aggregations in the lower right of my screen so what you can see here is that I have a count of of 15 which means that there are 15 rows for level and that the maximum level is 40 the minimum is 11 and then I have an average level of 21.3 more or less and if you sum all the levels you get 319 so this is already some useful information but now I would like to take it a step further and I would like to know this aggregate value within each class so for example what is the maximum level for warriors and what is the maximum level for Hobbits are they different how do they compare this is where aggregation comes into play so let us do just that now let us find the maximum level Within each class and let us see how we might achieve this now to make things quicker I’m going to sort the data to fit my purpose so I will select the range over here and then go to data sort range and then in the in the advanced options I will say that I want to sort by column B because that’s my class and now as you can see the data is ordered by class and I can see the different values for each class next I will take all the different values for class and separate them just like this so first I have Archer then I have hobbit then I have Mage and finally I have Warrior so here they are they’re all have their own sp right now finally I just need to take to compress each of these ranges so that each of them covers only one row so for Archer I will take the value of the class Archer and then I will have to compress these numbers to a single number and to do that I will use the max function this is the aggregation function that we are using and quite intuitively this function will look at the list of values we’ll pick the biggest one and it will reduce everything to the biggest value and you can also see it here in this tool tip over here doing the same for Hobbit compress all of the values to a single value and then compress all of the numbers to a single number by applying a an aggregation function so I’ve gone ahead and done the same for mage and Warrior and all that’s left to do is to take this and bring all these rows together and this is my result this is doing what I have asked for I was looking to find the maximum level Within each class so I have taken all the unique values of class and then all the values of level within each class I have compressed them to a single number by taking the maximum and so here I have a nice summary which shows me what the maximum level is for each class and I can see that mes are much more powerful than everyone and that Hobbits are much more weaker according to this measure I’ve learned something new about my data now crucially and this is very important in my results I have class which is a grouping field and then level which is an aggregate field okay so what exactly do I mean by this now class is a grouping field because it divides my data in several groups So based on the value of class I have divided my data as you see here so Archer has three values Hobbit has four values and so on level is an aggregate field because it was obtained by taking a list of several values so here we have three here we have four and in the wild we could have a thousand or 100 thousand or Millions it doesn’t matter it’s a list of multiple values and then I’ve taken these values and compressed them down to one value I have aggregated them down to one value and this is why level is an aggregate field and whenever you work with groups and aggregations you always have this division okay you are have some fields that you use for grouping you know for subdividing your data and then you have some fields on which you run aggregations and aggregations such as for example looking at a list of value and taking the maximum value or the average or the minimum and so on aggregations are what allow you to understand the differences between groups so after aggregating you can say oh well the the Mages are certainly much more powerful than the hobbits and so on and if you look work with the dashboards like Tableau or other analytical tools you will see that another way to refer to these terms is by calling the grouping Fields dimensions and the aggregate Fields measures okay so I’m just leaving it here you can say grouping field and aggregate field or you can talk about dimensions and measures and they typically refer to the same type of idea now let’s see how I can achieve the same result in SQL so I will start a new query here and I want to get data from fantasy. characters and after I’ve sourced this table I want to Define my groups okay so I will use Group by which is my new clause and then here I will have to specify the grouping field I will have to specify the group that I want to use in order to subdivide the data and that group is class in this case after that I will want to define the columns that I want to see in my result so I will say select and first of all I want to see the class and then I want to see the maximum level within each class so if I run this you will see that I get exactly the same result that I have in Google Sheets so we have seen this before Max is an aggregation function it takes a list of Val vales and then compresses them down to a single value right except that before we were running it on at the level of the whole table right so if I select this query alone and run it what do you expect to see I expect to see a single value because it has looked at all the levels in the table and it has simply selected the biggest one it has reduced all of them to a single value however if I run it after defining a group buy then this will run not on the whole table at once it will run within each group identified by my grouping field and we’ll compute the maximum within that group and so the result of this will be that I can see the maximum level for each group now I’m going to delete this and I don’t need to limit myself to a single aggregation I can write as many aggregations as I wish so I will put this down here and I’ll actually give it a label so that it makes sense and then I will write a bunch of other aggregations such as count star which basically is the number of values within that class um I can also look at the minimum level I can also look at the average level so let’s run this and make sure that it works so as you can see we have our unique values for class as usual and then and for each class we can compute as many aggregated values as we want so we have the maximum level the minimum level and we didn’t give a label to this so we can call it average level and then number of values n values is not referring to level in itself it’s a more General aggregation which is simply counting how many examples I have of each class right so I know I have four Mages three archers four Hobbits and four Warriors by looking at this value over here and here’s another thing I am absolutely not limited to the level column as you can see I also have the experience column which is also an integer and the health column which is a floating Point number so I can get the maximum health and I can get the minimum [Music] experience and it all works all the same all the aggregations are computed within each class but one thing I need to be really careful of is the match between the type of aggregation that I want to run and the data type of the field on which I plan to run it so all of these that we show here they’re number columns right either integers or floats what would happen if I ran the average aggregation on the name column which is a string what do you expect to happen you can already see that this is an error why no matching signature for aggregate function average for a type string so it’s saying this function does not accept the type string it accepts integer float and all types of number columns but if you ask me to find the average between a bunch of strings I have no idea how to do that so I can add as many aggregations as I want within my grouping but the aggregations need to make sense but these Expressions can be as complex as I want them to be so instead of taking the average of the name which is a string it doesn’t make sense I could actually run another function instead of this inside of this which is length and what I expect this to do is that for each name it will count how long that name is and then after I’ve done all these counts I can aggregate them uh I could take the average for them and what I get back is the average name length within each class doesn’t sound really helpful as a thing to calculate but this is just to show you that these Expressions can get quite complex now whatever system you’re working with it will have a documentation in some place which lists all the aggregate functions that you have at your disposal so here is that page for big query and as you can see here we have our aggregate functions and if you go through the list you will see some of the ones that I’ve shown you such as count Max mean and some others that uh I haven’t shown you in this example such as sum so summing up all the values um any value which simply picks uh one value I think it it happens at random and U array a which actually built a list out of those values and so on so when you need to do an analysis you can start by asking yourself how do I want to subdivide the data what are the different groups that I want to find in the data and then after that you can ask yourself what type of aggregations do I need within each group what do I want to know um about each group and then you can go here and try to find the aggregate function that works best and once you think you found it you can go to the documentation for that function and you can read the description so Returns the average of non-null values in an aggregated group and then you can see what type of argument is supported for example average supports any numeric input type right so any data type that represents a number as well as interval which represents a space of time now in the previous example we have used a single grouping field right so if we go back here we have our grouping field which is class and we only use this one field to subdivide the data but you can actually use multiple grouping Fields so let’s see how that works what I have here is my items table and for each item we have an item type and a rarity type uh and then for each item we know the power so what would happen if we wanted to say to see the average Power by item type and Rarity combination one reason we might want to see this is that we might ask ourselves is within every item type is it always true that if you go from common to rare to Legendary the power increases is this true for all item types or only for certain item types let us go and find out so what what I’m going to do now is that I’m going to use two fields to subdivide my data I’m going to use item type and Rarity and to do this as a first step I will sort the data so that it makes it convenient for me so I will go here and I will say sort range Advanced ranged sorting option and first of all I want to sort by column A which is item type and I want to add another sort column which will be column B and you can see that my data has been sorted next I’m going to take each unique combination of the values of my two grouping Fields okay so the first combination is armor common so I’m going to take this here and then I’m going to to write down all the values that come within this combination so in this case we only have one value which is 40 next I have armor legendary and within this combination I only have one value which is 90 next I have armor rare So for armor rare I actually have two values so I’m going to write them here next we have potion and common for this we actually have three values so I’m going to write them here so I’ve gone ahead and I’ve done it for each combination and you can see that each unique combination of item type and Rarity I’ve now copied the re relevant values and now I need to get the average power with in these combinations so I will take the first one put it over here and then I will take the average of the values this is quite easy because there’s a single value so I’ll simply write 40 next I will take the armor legendary combination and once again I have a single value for armor rare I have two values so I will actually press equal and write average to call the the spreadsheet function and then select the two values in here to compute the average and here we have it and I can go on like this potion common get the average Within These values potion legendary is a single value so I’ve gone ahead and completed this and this gives me the result of my query here I have all the different combinations for the values of uh what were they item type and Rarity and within each combination the average power so to answer my question is it that within each item type the power grows with the level of Rarity where for armor it goes from 40 to 74 to 90 so yes for potion we don’t have um a rare potion but basically it also grows from common to Legendary and in weapon we have uh 74 87 and 98 so I would say yes within each item type power grows with the level of Rarity so what are these three fields in the context of my grouping well item type is grouping field and Rarity is also a grouping field and the average power within each group is a aggregate field right so I am now using two grouping fields to subdivide my data and then I’m Computing this aggregation within those groups so let us now figure figure out how to write this in SQL it’s actually quite similar to what we’ve seen before we have to take our data from the items table and then we want to group by and here I have to list my grouping Fields okay so as I’ve said I have two grouping Fields they are item type and and Rarity so this defines my groups and then in the select part I will want to see my grouping fields and then within each group I will want to see the average of power I believe we used yes so I will get the average of power and here are our results just like in the sheets now as a tiny detail you may notice that power here is colored in blue and the reason for this is that power is actually a big query function so if you do power of two three you should get uh eight because it calculates the two to to to the power of three so it can be confusing when power is the name of a column because B query might think it’s a function but there’s an easy way to remedy this you can just use back ticks and that’s your way of telling big query hey don’t get confused this is not the name of a function this is actually the name of a column and as you can see it also works and it doesn’t create issues and just like before we could add as many aggregations as we wanted and for example we could take the sum of power also on other fields not just on Power and everything would be computed within the groups defined by the two grouping fields that I have chosen as expected now now let us see where Group by fits in The Logical order of SQL operations so as you know a SQL query starts with from and join this is where we Source the data this is where we take the data that we need and as we learned in the join section we could either just specify a single table in the from clause or we could specify a join of two or more tables either way the result is the same we have assembled the table where our data leaves and we’re going to run our Pipeline on that data we’re going to run all the next operations on that data next the work Clause comes into play which we can use in order to filter out rows that we don’t need and then finally our group group Pi executes so the group Pi is going to work on the data that we have sourced minus the rows that we have excluded and then the group Pi is going to fundamentally alter the structure of our table because as you have seen in our examples the group I basically compresses down our values or squishes them as I wrote here because in the grouping field you will get a single Row for each distinct value and then in the aggregate field you will get an aggregate value within each class okay so if I use a group bu it’s going to alter the structure of my table after doing the group bu I can compute my aggregations like you’ve seen in our examples so I can compute uh minimum maximum average sum count and and all of that and of course I need to do this after I have applied my grouping and after that after I I’ve computed my aggregations I can select them right so I can choose which columns to see um and this will include the grouping fields and the aggregated fields we shall see this more in detail in a second and then finally there’s all the other oper ations that we have seen in this course and this is where Group by and aggregations fit in our order of SQL operations now I want to show you an error that’s extremely common when starting to work with group pi and if you understand this error I promise you you will avoid a lot of headaches when solving SQL problems so I have my IDE items table here again and you can see the preview on the right and I have a simple SQL query okay so take the items table Group by item type and then show me the item type and the average level of power within that item type so so far so good but what if I wanted to see what I’m showing you here in the comments what if I wanted to see each specific item the name of that item the type of that item and then the average Power by the type of that item right so let’s look at the first item chain mail armor this is a armor type of item and we know that the average power for armors is 69.5 so I would like to see this row and then let’s take Elven bow now Elven baow is a weapon as you can see here the average powerful weapons is 85. 58 and so I would like to see that now stop for a second and think how might I achieve this how might I modify my SQL query to achieve this oh and there is a error in the column name over here because I actually wanted to say name but let’s see how to do it in the SQL query so you might be tempted to Simply go to your query and add the name field in order to reproduce What you see here and if I do this and I run it you will see that I get an error select expression references column name which is neither grouped nor aggregated understanding this error is what I want to achieve now because it’s very important so can you try to figure out on your own why this query is failing and what exactly this error message means so I’m going to go back to my spreadsheet and get a copy of my items table and as you can see I have copied the query that doesn’t work over here so let us now uh go ahead and reproduce this query so I have to take the items table here it is and then I have to group by item type and as you can see I’ve already sorted by item type to facilitate our work and then for each item we want to select the item type so that would be armor and we want to select the average power so to find that I can run a spreadsheet function like this it’s called average and get the power over here and then I am asked to get the name so if I take the name for armor and put it here this is what I have to add and here you can already see the problem that we are facing for this particular class armor there is a mismatch in the number of rows that each column is providing because as an effect of group by item type now there is only one row in which item type is armor and as an effect of applying average to power within the armor group now there is only one row of power corresponding to the armor group but then when it comes to name it’s neither present in a group Pi nor is it present in an aggregate function and that means that in the case of name we still have four values four values instead of one and this mismatch is an issue SQL cannot accept it because SQL doesn’t know how to combine columns which have different numbers of rows in a way it’s like SQL is telling us look you’ve told me to group the data by item type and I did so I found all the rows that correspond to armor and then you told me to take the average of the power level for those rows and I did but then you asked me for name now the item type armor has four names in it what am I supposed to do with them how am I supposed to combine them how am I supposed to squish them into a single value you haven’t explained how to do that so I cannot do it and this takes us to a fundamental rule of SQL something I like to call the law of grouping and the law of grouping is actually quite simple but essential it tells you what type of columns you can select after you’ve run a group pi and there are basically two types of columns that you can select after running a group bu one is grouping Fields so those those are the columns that appear after the group by Clause those are the columns you are using to group the data and two aggregations of other fields okay so those are fields that go inside a Max function a mean function a sum function a count function and so on now those are the only two types of columns that you can select if you try to select any other column you will get an error and the reason you will get an error is Illustrated here after a group Pi each value in the grouping Fields is repeated exactly once and then for that value the aggregation makes sure that there’s only one corresponding value in the aggregated field in this case there’s only one average power number within each item type however any other field if it’s not a grouping field and you haven’t run an aggregation on it you’re going to get all of its values and then there’s going to be a mismatch so the law of grouping is made to prevent this issue now if we go back to our SQL hopefully you understand now better why this error Isen happening and in fact this error message makes a lot more sense after you’ve heard about the law of grouping you are referencing a column name which is neither grouped nor aggregated so how could we change this code so that we can include the column name without triggering an error well we have two options either we turn it into a grouping field or we turn it into an aggregation so let’s try turning it into an aggregation let’s say for example that I said mean of name what do you expect would happen in that case so if I run this you will see that I have my grouping by item type I have the average power within each item type and then I have one name and so when you run mean on a sequence of uh text values what it does is that it gives you the first value in alphabetical order so we are in fact seeing the first name in alphabetical order within each item type so we’ve overcome the error but this field is actually not very useful we don’t really care to see what’s the first name in alphabetical order within each type but at least our aggregation is making sure that there’s only one value of name for each item type and so the golden rule of grouping is respected and we don’t get that error anymore the second alternative is to take name and add it as a grouping field which simply means putting it after item type type in here now what do you expect to happen if I run this query so these results as they show here are a bit misleading because there’s actually the name column is hidden so I will also add it here and as you can see I can now refer the name column in select without an aggregation why because it is a grouping field okay and what do we see here in the results well we’ve seen what happens when you Group by multiple columns that the unique combinations of these columns end up subdividing the data so in fact our values for average power are not divided by item type anymore we don’t have the average power for armor potion and weapon anymore we have the average power for an item that’s type armor and it’s called chain mail armor and that is in fact there’s only one row that does that and has power 70 likewise we have the average power for uh any item called cloak of invisibility which is of item type armor and again there’s only one example of that so we’ve overcome our error by adding name as a grouping field but we have lost the original group division by item type and we have subdivided the data to the point that it doesn’t make sense anymore so as you surely have noticed by now we made the error Disappear by including name but we haven’t actually achieved our original objective which was to show the name of each item the item type and then the average power within that item type well to be honest my original objective was to teach you to spot this error and understand the law of grouping but now you might rightfully ask how do I actually achieve this and the answer unfortunately is that you cannot achieve this with group Pi not in a direct simple way and this is a limitation of group Pi which is a very powerful feature but it doesn’t satisfy all the requirements of aggregating data the good news however is that this can be easily achieved with another feature called window functions now window functions are the object of another section of this course so I’m not going to go into depth now but I will write the window function for you just to demonstrate that it can be done easily with that feature so I’m going to go down here and write a a new query I’m going to take the items table and I’m going to select the name and the item type and then I’m going to get the average of power and again I’m going to use back ticks so bigquery doesn’t get confused with the function that has the same name and then I’m going to say take the average of power over Partition by item type so this is like saying average of power based on this item type and I will call this average Power by type and if I select this and run the query you will see that I get what I need I have a chain mail armor it’s armor and the average power for an armor is 69.5 so this is how we can achieve the original objective unfortunately not with grouping but with window functions now I want to show you how you can filter on aggregated values after a group buy so what I have here is a basic Group by query go to the fantasy characters table group it by class and then show me the class and within each class the average of the experience for all the characters in that class and you can see the results here now what if I wanted to only keep those classes where the average experience is at least 7,000 how could I go and do that one Instinct you might have is to add a wear filter right for example Le I could say where average experience is greater than or equal to 7,000 and if I run this I get an error unrecognized name average experience the wear filter doesn’t work here maybe it’s a labeling problem what if I actually add the logic instead of the label so what if I say where average of experience is bigger or equal to 7,000 well an aggregate function is actually not allowed in the work Clause so this also doesn’t work what’s happening here now if we look at the order of SQL operations we can see that the where Clause runs right after sourcing the data and according to our rules over here an operation can only use data produced before it and doesn’t know about data produced after it so the wear operation cannot have any way of knowing about aggregations which are computed later after it runs and after running the group bu and this is why it is not allowed to use aggregations inside the wear filter luckily SQL provides us with a having operation which works just like the wear filter except it works on aggregations and it works on aggregations because it happens after the group buy and after the aggregations so to summarize you can Source the table and then drop rows before grouping this is what the wear filter is for and then you can do your grouping and Compu your aggregations and after that you have another chance to drop rows based on a filter that runs on your aggregations so let us see how that works in practice now instead of saying where average experience actually let me just show you what we had before this is our actual result and we want to keep only those rows where average experience is at least 7,000 so after group Pi I will write having and then I will say average experience greater than or equal to 7,000 let me remove this part here run the query and you can see that we get what we need and you might be thinking well why do I have to to write down the function again can’t I just use the label that I’ve assigned well let’s try it and see if it still works and the answer is that yes this works in Big query however you should be aware that bigquery is an especially userfriendly and funto use product in many databases however this is actually not allowed in the sense that the database will not be kind enough to recognize your label in the having operation instead you will have to actually repeat the logic as I’m doing now and this is why I write it like this because I want you to be aware of this limitation another thing that you might not realize immediately is that you you can also filter by aggregated columns which you are not selecting so let’s say that I wanted to group by class and get the average experience for each class but only keep classes with a high enough average level I am perfectly able to do that I just have to write having average level greater than or equal to 20 and after I run this you will see that instead of four values I actually get three values so I’ve lost one value and average level is not shown in the results but I can of course show it and you will realize that out of the values that have stayed they all respect this condition they all have at least 20 of average level so in having you are free to write filters on aggregated values regardless of the columns that you are selecting so to summarize once more you get the data that you need you drop rows that are not needed you can then Group by if you want subdivide the data and then compute aggregations within those groups if you’ve done that you have the option to now filter on the result of those aggregations and then finally you can pick which columns you want to see and then apply all the other operations that we have seen in the course we are now ready to learn about window function a very powerful tool in SQL now window functions allow us to do computations and aggregations on multiple rows in that sense they are similar to what we have seen with aggregations and group bu the fundamental difference between grouping and window function is that grouping is fundamentally altering the structure of the table right because if I go here and I take this items table and I group by item type right now I’m looking at uh about 20 rows right but if I were to group the resulting table would have one two three three rows only because there’re only three types of items so that would significantly compress the structure of my table and in fact we have seen with the basic law of grouping that after you apply a group ey you have to work around this fundamental alteration in the structure of a table right because here you can see that the items table has 20 rows but how many rows do you expect it to have after you Group by item type I would expect it to have three rows because there’s only three types of items and so my table is being compressed my table is changing its structure and the basic law of grouping teaches you how to work with that it tells you that if you want a group by item type you can just select power as is because your table will have three rows but you have 20 values of power so you have to instead select an aggregation on power so that you can compress those values to a single value for each item type and if you want to select name you also cannot select name as is you also have to apply some sort of aggregation for example you could put the names into a list an array uh or so on but window functions are different window functions allow us to do aggregations allow us to work on multiple values without however altering the structure of the table without changing the number of rows of the table so let us see how this works in practice now imagine that I wanted to get the sum of all the power values for my items so what is the total power for all of my items so you should already be aware of how to do this in SQL to just get that sum right I can I can do this by getting my fantasy items table and then selecting the sum over the power so if I take this query and paste it in big query I will get exactly that and this now is a typical aggregation right the sum aggregation has taken 20 different values of power and has compressed them down to one value and it has done the same to my table it’s taken 20 different rows to my table and it has squished them it has compressed them down to one row and this is how aggregations work as we’ve seen in the course but what if I wanted to show the total power without altering the structure of the table what if I wanted to show the total power on every Row in other words I can take the sum of all the values of power and this is the same number that we’ve seen in B query and I can paste it over here and hopefully I can now expand it and this is exactly what I meant what if I can take that number and put it on every row and why would I want to do this well there’s several things that I can do with this setup right for example I could go here um for Phoenix Feather which is power 100 and I could say take this 100 and divide it by the total power in this row and then turn this into a percentage and now I have this 6.5 approximately percentage and thanks to this I can say Hey look um the phoenix feather covers about 6 or 7% of all the power that is in my items of all the power that is in my game and that might be a useful information a more mundane concern uh could be that this is uh your your budget so this is the stuff you’re spending on and instead of power you have the the price of everything and then you get the total sum right which is maybe what you spent in a month and then you want to know going at the movies what percent of your budget it covered and so on now I will delete this value because we’re not going to use it and let us see what we need to write to obtain this result in SQL so once again we go to the fantasy items table and I’m going to move it a bit down and then we select the sum power just just like before except that now I’m going to add this over open round bracket and close round bracket and this is enough to obtain this result well to be precise when I write this in B query I will want to see a few columns as well so I will want to see the name item tab Ty and power and here I will need a comma at the end as well as the sum power over and I will also want to give a label to this just like I have in the spreadsheet now this is the query that will reproduce What you see here in the spreadsheet so how this works is that the over keyword is signaling to SQL that you want to use a window function and this means that you will get an aggregation you will do a calculation but you’re not going to alter the structure of the table you are simply going to take the value and put it in each row this is what the over keyword signals to SQL now because this is a window function we also need to define a window what exactly is a window a window is the part of the table that each row is is able to see now we will understand what this means much more in detail by the end of this lecture so don’t worry about it but for now I want to show you that this is the place where we usually specify the window inside these brackets after the over but we have nothing here and what this means is that our window for each row is the entire table so that’s pretty simple right each row sees the entire table so to understand how the window function is working we always have to think row by row because the results can always be different on different rows so let us go row by row and figure out how this window function is working so now we have the first row and what is the window in this case meaning what part of the table does does this row see well the answer is that this row sees all of the table given that it sees all of the table it has to do the sum of power and so it will take this thing compute a sum over it put it in the cell now that was the first row moving on to the second row now what’s the window here what part of the table does this row see once again it sees all of the table given that it sees all of the table it takes power computes some over it gets the result and puts it in the cell now I hope you can see that the result has to be identical in every cell in every Row in other words because every row sees the same thing and every Row computes the same thing and this is why every Row in here gets the same value and this is probably the simplest possible use of a window function so let us now take this code and bring it to B query and make sure that it runs as intended and like I said in the lecture on grouping you will see that power is blue because bequer is getting confused with its functions so always be best practice to put it into back tis to be very explicit that you are referring to a column but basically what you see here is exactly what we have in our sheet and now of course we have this new field which shows me the total of power on every row and like I said we can use this for several purposes for example I can decide to show for each item what p percentage of total power it covers right that’s what I did before in the sheet so to do this I can take the power and I can divide by this window expression which will give me the total power not sure what happened there but let me copy paste here and I can call this percent total power now this is actually just a division so if I want to see the percentage I will have to also multiply by 100 but we know how to do this and once I look at this we can see that when we have power 100 we have almost 6.5% of the total power so this is the same thing that we did before and this goes to show that you can use these fields for your calculations and like I said if this was your budget you could use this to calculate what percentage of your total budget is covered by each item it’s a pretty handy thing to know now why do I have to take this uh to repeat uh all of this logic over here why can’t I just say give me power divided by some power well as you know from other parts of the course the select part is not aware of these aliases it’s not aware of these labels that we are providing so when I try to do this it won’t recognize the label so unfortunately if I want to show both I have to repeat the logic and of course I’m not limited to just taking the sum right what I have here is an aggregation function just like the ones we’ve seen with simple aggregations and grouping in aggregation so instead of sum I could use something like average using the back TI over right I need to remember uh to add the over otherwise it won’t work because it won’t know it’s a window function and I can give it a label and now for each row I will see the same value which is the average of power over the whole data set and you you can basically use any aggregation function that you need it will work all the same few more btics to put in here just to be precise but the result is what we expect now let us proceed with our Explorations so I would like now to see the total power for each row but now I’m not interested in the total power of the data set I’m interested of in the total Power by item type okay so if my item is an armor I want to see the total power of all armors if my have item is a potion I want to see the total power of all potions and so on because I want to compare items within their category I don’t want to compare every item with every item so how can I achieve this in the spreadsheet well let us start with the first r row so I need to check what item type I have and conveniently I have sorted this so we can be quicker now we have an armor so I want to see the total power for armor so what I can do is to get the sum function and be careful to select only rows where the item type is armor and this is what I get and then the next step would be to Simply copy this value and then fill in all of the rows which are armor because for all of the rows but again you have to be careful because the spreadsheet wants to complete the pattern but what I want is the exact same number and then all of the rows that have item type armor will have this value because I’m looking within the item type now I will do it for potion so here I need to get the sum of power for all items that are potions 239 and then make sure to co copy the exact same value and to extend it to all potions and next we have weapons so sum of all power by weapon which is here then copy it and copy it and then let’s see if it tries to complete the pattern it does so I’m just going to go ahead and paste it and now make this a bit nicer and now I have what I wanted to get each row is showing the total power within the items that are the same as the one that we see in the row now how can I write this in SQL so let me go ahead and write it here now two parts of this query will be the same same because we want to get the items table and see these columns but we need to change how we write the window function so once again I want to get the sum of power and I will need now to define a specific window now remember the window defines what each row sees so what do I want each row to see when it takes the sum of power for example what do I want the first row to see when it takes the sum of power I wanted to see only rows which have the item type armor or in other words all the rows with the same item type and I can achieve this in the window function by writing Partition by item type by adding a partition defining the window as a partition by item type means that each row will look at its item type and then we’ll partition the table so that it only sees rows which have the same item type so this row over here will see only these four rows and then you will take the sum of power and then you will put it in the cell and for this uh the second third third and fourth row the result will be the same because they will each see this part of the table when we come to potion so this row over here will say hey what is my item type it’s potion okay then I I will only look at rows that have item type potion and so this will be the window for these four rows and then in those rows I’m going to take power and I’m going to Summit and finally when we come to to these rows over here so starting with this row it will look at its item type and say okay I have item type uh weapon let me look at all the rows that share the same item type and so each window will look like this so let me color it properly its window will look like this and then it will take the sum of these values of power that fit in the window and put it in the cell second cell sees the same window sums over these values of power puts it in the cell and this is how we get the required result this is how we use partitioning in window functions so let’s go now to Big query and make sure that this actually works and when I run this I didn’t put a label but you can see that I’m basically getting the same result when I have a weapon I see a certain value when I have a potion I see uh another one and when I have an armor I see the third value so now for each item I am seeing the total power not over the whole table but within the item type now next task find the cumulative sum of power which is this column over here what is a cumulative sum it’s the sum of the powers of this item plus all of the items that are less powerful so to do this in the spreadsheet I will first want to reorder my data because I want to see it simply in order of power so I will actually take this whole range and I will go to data sort range Advance options and I will say that the data has a header row so that I can see the names of the columns and then I will order by power ascending so as you can see my records have now been sorted in direction of ascending power now how do I compute the cumulative sum of power in the first row all we have is 30 so the sum will be 30 in the second row I have 40 in this row plus 30 before so E I will have 70 when it comes here I have 50 in this row and then the sum up to now was actually 70 which I can see by looking at these two cells or I can see more simply by looking at the last cell so 50 + 70 will be 820 and proceeding like this I could compute the cumulative power over the whole column now for your reference I have figured out the correct Google Sheets formula that will allow you to compute the cumulative sum of power for our example and I went ahead and computed it so that we have it for all our data now this is is the formula right here and I’m not going to go in depth into it because this is not a course on on spreadsheets but I will show you the formula just in case you’re curious so the sum IF function will take the sum over a range only but it will only consider values that satisfy a certain logical condition so the first argument is the range that we want to sum over and this is the power and the Criterion so what needs to be true for a value to be um to be considered is that this value is lesser than or equal to the level of power in this row so what this formula is saying is take the level of power in this row and then take all the values of power which are lesser or equal and then sum them up this is exactly what our window function does and so our formula reproduces this now if you go and look what’s the way to do a cumulative sum in Google Sheets or what’s the way to do a running total there are other Solutions but they do come with some um pitfalls they do come with some Corner cases so this is a Formula that’s actually reproducing the behavior of SQL now let us go back to actually SQL and see how we would write this so I’m going to take the fantasy items table and I’m still going to select the columns and now I have to write my window function now the aggregation is just the same so take the sum of power and now I have to Define my window now my window is not defined Now by a partition but it is defined by an ordering order by power and when I say order by power in a window function what’s implicit in this is the keyword ask for ascending so this means that the window will order power from the smallest to the biggest and I can choose to write this keyword or not because just like in order by in SQL when you don’t specify it the default value is ascending from smallest to biggest so how does this window work work let’s start with the first row and let’s say we need to fill in this value so I’m going to look at my power level it is 30 and then the window says that I can only see rows where the power level is equal or smaller and what are the rows where the power level is equal or smaller to 30 there’re these rows over here so effectively this this is the only part of the table that this window sees on the first row and then take the sum over power so sum over 30 is 30 move on to the second row the power level is 40 the window says I only see rows where the power level is smaller uh or equal and this includes these two rows over here now take the sum of power over here you get 70 put it in the cell third row I have power level 50 I’m only seeing these rows so take the sum of power over this it’s 120 put it in the cell and I can continue like this until I get to the highest value in my data set it’s 100 never mind that is not the last row because both of the last two rows they have the highest value and when you look at this um when you come to this row and you look at 100 and and you say what’s the window what rows can I see I can see all rows where power is 100 or less and that basically includes all of the table right it includes all of the table so when you take the sum of power you will get the total sum and in fact you can see that in this case the cumulative power is equal to the total power that we computed before just as we would expect so this is easy to see here because we have ordered um our data conveniently but it works in any case and so what the order by does in a window function is that it makes sure that each row only sees rows which come before it given your ordering so if I want to order from the smallest power to the biggest power each row will only see rows that come before it in this ordering so they have the same level of power or lower but they don’t have a higher level of power so let us now take it to Big query and make sure it works as intended and I will add an ordering by power and here I will see the same thing that I’ve shown you in the spreadsheet I notice now that some numbers are different that these two items have 90 instead of 100 but never mind that the logic is the same and the numbers make sense now I’m also able to change the direction of the ordering right so let’s say that I take this field and copy it just the same except that instead of ordering by power ascending I order by power descending so what do you expect to see in this case let’s take a look now what I see here is that each item is going to look at its level of power and then it’s only going to consider items that are just as powerful or more powerful right so it’s the exact same logic but it’s reversed so when you look at the weakest item potion it has 30 and so it is looking at all the items because there’s no weaker item and so it finds the total level of power in our data set but if you go to the strongest item like Excalibur it has a power level of 100 and there’s only two items in the whole data set that have this power level itself and the phoenix feather so if you sum the power over this you get 200 so you can see it’s the exact same logic but now each row only sees items that have the same level of power or higher so when you order inside a window function you can decide the direction of this ordering by using descending or ascending or if you are a lazy programmer you can omit the um ascending key word and it will work just the same because that’s the default and finally we want to compute the cumulative sum of Power by type and you might notice that it is in a way the combination of these two uh requirements so let us see how to do that now the first thing I want to do is to sort our data in order to help us so I’m going to get this whole thing and I’m going to say sort range I’m going to need the advanced options I have a heading row and so first of all I want to order by type and then within each type I want to order by power and this is our data now now for each item I want to show the cumulative sum of power just like I did here except that now I only want to do it within the same item type so if we look at Armor it’s already sorted right so I have power 40 and this is the smallest one so I will just put 40 over here next I have uh this item with power 70 it’s still armor has power 70 and so I’m going to look at these two values and sum them up now I have uh 7 8 so I will take this plus 78 which is the sum of these three values and finally I have um 90 which is the sum of those values and now I’m done with armor right I’m beginning with a new item type so I have to start all over again I’m looking at potions now so we start with 30 that is the smallest value then we move to uh 50 so this is now seeing 30 and 50 uh which is 80 add 60 to 80 that is 140 and finally we want to add we want to add 99 plus 140 which is another way of saying that we want to add these values all the values for potion so this is what we want cumulative sum of power within item type so we do it within the item type and then when we find a new type we start over so to calculate it for weapon I could copy my function from here paste it in weapon and then I would need to modify it right I would need the range to only include weapon so that’s from C10 so go here C10 is the first one and the value that I want to look at here would have to be C10 as well because I want to start by looking at the power level for the for weapon and for some reason it’s purple however it should be correct it should always be the sum of the previous value so we start with 65 then we have 65 + 75 66 75 65 and so on so this is our result it’s cumulative power within the item type and to write this in SQL I will take my previous query over here and now when we Define the window we can simply combine what we’ve done before we can combine the partition buy with the order bu and you need to write them in the following order first the partition and then the order so I will Partition by item type and I will order by power ascending and this will achieve the required result so for each row in this field the window will be defined as follows first Partition by item type right so first of all you can only see rows which have the same item type as you have but then within this partition you can you have to keep only rows where the power is equal or smaller than what you have so in the case of the first item you only get this row likewise in the case of the first potion item you only get this row if you look at the second armor item again it looks it partitions right so it looks at all the items which have armor but then it has to discard those that have a bigger power than itself so it will be looking at these two rows and if for example example we look at the last row over here so this row will say oh okay I’m a weapon so I can only see those that are weapon and then I can only see those that have a level of power that’s equal or smaller than mine and that checks out those are all the rows and in fact the sum over here is equal to the sum of Power by type which is what we would expect so once again let us verify that this works in Big query and I will actually want to order by item type and power just so I have the same ordering as in my sheet and I should be able to see that within armor you have this like growing uh cumulative sum and then once the item changes it starts all over right it starts again at the value it grows it grows it accumulates and then we’re done with potions and then we have weapons and then again it starts and then it grows and it goes all the way to include the total sum of all powers in the weapon item type so here’s a summary of all the variants of Windows that we’ve seen we have seen four variants now in all of those for clarity we’ve kept the aggregation identical right we are doing some over the power field but of course you know that you can use any aggregate function here on any column which is compatible with that aggregate function and then we have defined four different Windows the first one is the simplest one there’s actually nothing in the definition we just say over and this means that it will just look at all the table so every row will see the whole table and so every row will show you the total of power for the whole table simple as that the second window is introducing a partition by item type and what this means in practice is that each row will uh look at its own item type and then only consider rows which share the same exact item type and So within those rows it will calculate the sum of power third window we have an ordering field so what this means is that each row is going to look at its level of power because we are ordering by power and then it’s going to only see rows where the power level is equal or smaller and the reason why we’re looking in this direction is that when we order by power is implicitly uh understood that we want to order by power ascending If instead we ordered by power descending it would be the same just in the opposite direction each row would would look at its level of power and then only consider rows where power is equal or bigger and then finally we have a combination of these two right a we have a window where we use both a partition and an order and so what this means is that uh each row is going to look at its item type and discard all of the rows which don’t have the same item type but then within the rows that remain it’s going to apply that ordering it’s going to only consider rows which have the same level of power or lesser so it’s simply a combination of these two conditions and this is the gist of how window functions work first thing to remember window function provide aggregation but they don’t change the structure of the table they just insert a specific value at each row but after applying a window function the number of rows in your table is the same second thing thing to remember is that in the window definition you get to Define what each row is able to see when Computing the aggregation so when you are thinking about window function you should be asking yourself what part of the table does each row see what’s the perspective that each row has and there are two dimensions on which you can work in order to Define these windows one is the partition Dimension and the other is the ordering Dimension the partition Dimension Cuts up the table based on the value of a column so you will only keep rows that have the same value the order Dimension Cuts up the table based on the ordering of a field and then depending on ascending or descending depending on the direction that you choose you can you can look at rows that are after you in the ordering or you can look at rows that are before you in the ordering and you can pick either of these right either partitioning or ordering or you can combine them and by using this you can Define all of the windows that you might need to get your data now as a quick extension of this I want to show you that you’re not limited to defining windows on single fields on single columns you can list as many columns as you want so in this example I’m going to the fantasy characters table I’m getting a few columns and then I’m defining an aggregation uh on a window function so I’m taking the level uh field and I’m summing it up and then I’m partitioning by two Fields actually by Guild and is alive so what do you expect to happen if I do this this is actually the exact same logic as grouping by multiple fields which we’ve seen in the group ey now the data is not going to be divided by Guild and is not going to be divided by whether the character is alive or not but by the all the mutual combinations between these fields okay so um merkwood and true is one combin ation and so the people in here are going to fit together right so in fact we have two characters here 22 and 26 and their sum is 48 so you can see here that they both get 48 for sum of level and likewise when you look at Sher folk true these three they all end up in the same group and so they all share the same sum of level which is 35 but sh Fulk fals this is another group and they’re actually alone right it’s 12 and then the sum is 12 so again when you Partition by multiple Fields the data is divided in groups that are obtained by all the combinations between the values that these fields can have and if you experiment a bit by yourself you should have an easier time to convince yourself of this likewise the same idea applies to the order uh part of a window we have until now for Simplicity ordered by one field to be honest most times you will only need to order by one field but sometimes you might want to order by different fields so in this example you can see that we are defining our ordering based on two Fields power and then weight and then based on that ordering we calculate the sum of power and this is again a case of cumulative sum however now the ordering is different and you will realize this if we go to the most powerful items in our data these last two which are both at 100 so if you remember when we were ordering by power alone these two uh Fields had the same value in this um window function because when you order just by power they are actually the same they both have 100 but because now we’re ordering by weight and again we’re ordering by weight ascending so from the smallest weight to the biggest weight now the phoenix feather comes first because although it has the same power as Excalibur the Phoenix weather is lighter and because it comes first it has a different value for this aggregation and of course we have the power to to say ascending or descending on each of the fields by which we order so if I wanted to reverse this I could simply write descending after the weight and be careful that in this case descending is only referring to weight it’s not referring to power so this is just as if I’ve wrote this right so the this one can be omitted um because it’s ascending by default but I would write both to be clear and now if I run this you will see that our result is reversed right Excalibur comes first because we have weight descending so it’s heavier and then last we have the phoenix feather which is lighter and again understanding this theoretically is one thing but I do encourage you to experiment with this with your data with exercises and then you will um you will be able to internalize it and now we are back to our schema for The Logical order of SQL operations and it is finally complete again because we’ve seen all of the components that we can use to assemble our SQL query and now the question is where do window functions fit into this well as you can see uh we have placed them right here so what happens is that again you get your data and then the we filter runs dropping rows which you don’t need and then you have a choice whether to do a group by right now if you do a group by you’re going to change the structure of your table it’s not going to have the same number of rows it’s going to have a number of rows that depends of the unique values of your grouping field or the unique combinations of values of your Fields if you have used more than one if you group you will probably want to compute some aggregations and then you may want to filter on those aggregations meaning dropping rows uh based on the values of those aggregations and here is where window functions come into play it is on this result that window functions work so if you haven’t done a group bu then window functions will work on your data after the wear filter runs if you have done a group buy we window functions will work on the result of your aggregation and then after applying the window function you can select which columns you want to show give them uh labels and then all the other parts run right so you can choose to drop duplicates from your result meaning duplicate rows rows which have the same value on every column you can stack together different tables right you can put them on top of each other and then finally when you have your result you can apply some ordering and also you can cut the result you can limit it so you only show a few uh rows and this is where window functions fit into the big scheme of things and there are some other implications of this ordering one interesting one is that if you have computed aggregations such as the sum of a value Within within a um a class um you can actually use those aggregations in the window function so you can sort of do an aggregation of an aggregation but this is uh in my opinion an advanced topic and it doesn’t fit into this um fundamentals course it may fit uh someday in a later more advanced course I want to show you another type of window functions which are very commonly used and very useful in SQL challenges and SQL interviews and these are numbering functions numbering functions are functions that we use in order to number the rows in our data according to our needs and there are several numbering functions but the three most important ones are without any doubt row number dense Rank and rank so let’s let’s see how they work in practice now what I have here is a part of my uh inventory table I’m basically showing you the item ID and the value of each number and conveniently I have ordered our rows uh by value ascending okay and now we are going to number our rows according to the value by using these window functions now I’ve already written the query that I want to reproduce so I’m going to the fantasy inventory table and then I’m selecting the item ID and the item value as you see here and then I’m using uh three window functions so the syntax is the same as what we’ve seen uh in the previous exercise except that now I’m not using an aggregation function over a field like I did before when I was doing a sum of power and so on but I’m using another type of function this is a numbering function okay so this functions over here they don’t actually take a parameter as you can see that there’s nothing between these round brackets because I don’t need to provide it an argument or a parameter all I need to do is to call the function but what really uh what’s really important here is to define the correct window and as you can see in the three examples here the windows are all the same I am simply ordering my rows by value ascending which means that when it’s going to compute the window function every row will look at its own value and then say okay I’m only going to see rows where the value is the same or smaller I’m not going to be able to visualize rows where the value is bigger than mine and this is what the window does so the first row over here will’ll only see value of 30 the second row will see this the third row will see these and so on up until the last row which will see itself and all the other rows as well now let us start with row number so row number is going to use this ordering over here in order to number my rows and it’s as simple as saying putting one in the first row two in the second one 3 four and so on so if I extend this pattern I’m going to get a number for every row and that’s it that’s all that row number does it assigns a unique integer number to every row based on the ordering that’s defined by the window function and you might think oh big deal why do I need this don’t I already have like row numbers over here in the spreadsheet well in Pro SQ problems you often need to order things based on different values and um row number allows you to do this you can also have many different orderings coexisting in the same table based on different conditions and that can come in handy as you will discover if you do SQL problems now let’s move on to ranking so first of all we have dense rank okay and ranking is another way of counting but is slightly different sometimes you just want to count things you know sometimes uh like we did here in row number like I don’t know you are a dog sitter and you’re given 20 dogs and you getting confused between all their their names and then you assign a unique number to every dog so that you can identify them uh and you can sort them by I don’t know by age or by how much you’re getting paid to docit them sometimes on the other hand you want to rank things like when choosing which product to buy or expressing the results of a race right if and the difference between ranking and Counting can be seen when you have the same value right so when you want to Simply number like we did here when you want to Simply assign assign a different number to each element and two things have the same value then you don’t really care right you need to sort of arbitrarily decide that okay one of them will be a number two and one of them will be number three but you cannot do the same for ranking if two students in a classroom get the best score you can’t just randomly choose that one of them is number one and the other is number two they have to both be number one right and if two people finish a race at at the same time and is the best time you can’t say that one uh won the race and the other didn’t that because one is number one the other is arbitrarily number two they both have to be number one right they have to share that Rank and this is where ranking differs so let’s go in here and apply our rank now we are ordering by value ascending which means that the smallest value will have rank number one and so 30 has rank number one now we go to the second row and again remember window functions that you always have to think row by row you have to think what each row sees and what each row decides so again the row is going to order by uh value so it’s only going to see these values over here and it has to decide its rank so this row says uh oh I’m not actually number one because there is a value which is smaller than me so that means I have to be number number two and then we get to the third row and this row is uh seeing all the values that come before it right they’re equal or or or smaller and now it’s saying oh I’m not number one because there’s something smaller but then uh the value 50 which uh this guy has uh is rank two and I have the same value number 50 we arrived in the same spot so I must have the same rank okay and this is the difference between row number and rank that identical values get the same rank but they don’t get the same row number and now we come to this row which is 60 so it’s going to look back and it’s going to say oh from what I see 30 is the smallest one so it has a rank of one and then you have 50 and 50 they both share a rank of two but I am bigger so I need a new rank and so what am I going to pick now as a new rank well I’m going to pick three because it’s the next uh number in the sequence then the next one is going to pick four the next one is going to pick five and then we have six and then it proceeds in the following way so I’ll do it quickly now so 7 8 9 10 11 and again careful here we’re sharing the same value so they are both 11 next we can proceed to 12 13 again the same value right so they have to share the 13th spot 14 so 14 for 1700 and then 14 again and then 15 and then 16 and this is what we expect to see when we compute the dense rank and finally we come to rank now rank is very similar to dense rank but there is one important difference so let’s do this again smallest value has rank number one like before and then we have 50 which has rank number two and then 50 is once more sharing rank number two and now we move from 50 to 60 so we need a new rank but instead of three we put four over here why do we put four because the previous rank covered uh two rows and it sort of at the three it sort of expanded to eight the three So based on the rules of Simply rank we have to lose the three and put four over here so this is just another way of managing ranking and you will notice that it conveys another piece of information compared to dense rank because not only I see that um this row over here has a different rank than the previous row but I can only I can also see how many members were covered by the previous uh ranks I can see that in the previous ranks uh they must have involved three members because I’m at four already and this piece of information was not available for dence rank so I will continue over here and so I have a new value which is uh rank five and then I have rank six rank seven rank 8 rank n Rank 10 rank 11 now I have rank 12 and again I have to share the rank 12 because two identical values but now because 12 has eaten up two spots I can’t use the 13 anymore the second 12 has like eaten the 13 and so I need to jump straight to 14 15 15 again and now I have to jump to 17 because 15 had two spots 17 again and now I have to jump to 19 and then finally I have 20 so you can see that the final number uh is 20 for rank just as with row number because it’s not only differentiating between ranks but it’s also counting for me how many elements have come before me how many rows are contained in the previous ranks I can tell that there’s 19 rows in the previous ranks uh because of how rank Works whereas with 10 rank we end ended up using only 16 uh ended up being only up to 16 so we sort of lost information on how many records we have and this might be one of the reasons why by default you have this method of ranking instead of this method of ranking even though dense rank seems more intuitive when you are uh building the ranking yourself so we can now take this query and hopefully I’ve written it correctly and go to big query and try to run it and as you can see we have our items they are sorted by value and then we have our numbering functions so row number should go from one to 20 without any surprises CU it’s just numbering the rows this dense rank should have rank one for the first and then these two should share the same rank because they have both have 50 and then the next rank is three so just as I’ve shown you in the spreadsheet similarly here you have 11 11 and then 12 rank uh instead starts off uh just the same uh smallest value has rank number one and the next two values have rank number two but then after using up two and two it’s like you’ve used up the three so you jump straight to four and after doing 15 and 15 you jump straight to 17 after doing 17 17 you jump straight to 19 and then the the highest number here is 20 which tells you how many rows you’re dealing with of course what you see here are window functions they work just the same as we I’ve shown you and so you could pick up Rank and you could order by value descending and then you will see you will find the inverse of that rank in the sense that the highest value item will give you rank one and it will go from there and the lowest value item will have sort of the the biggest rank number and and rank is often used like this you know the thing that has the most of what we want you know the biggest salary the biggest value the most successful product we rank it we make it so that it’s rank one it’s like the first in our race and then everyone else goes from there and so we often have actually we order by something descending when we calculate the rank and of course because these numbering functions are window functions they can also be combined with Partition by if you want to cut the data into subgroups so here’s an example on the fantasy characters table we are basically uh partitioning by class meaning that each row only sees the other rows that share the same class so archers only care about archers Warriors only care about Warriors and so forth and then within the class we are ordering by level descending okay so the highest levels come first and using this to rank the characters okay so if I go here then I can see that within the archers the highest level Archer has level 26 so they get the first Rank and then all the others is go down down from there and then we have our Warriors and the highest level Warrior is 25 and they also get rank one because they are being ranked within Warriors so this is like when you have races and there are categories this like when you have a race and there are categories within the race so there are like many people who arrive first because they arrive first in their category it’s not that everyone competes with everyone and so on and so forth you can see that each uh class of character has their own dedicated ranking and you can check the uh bigquery page on numbering function if you want to learn more about these functions you can see here the ones we’ve talked about rank row number and dense rank there are a few more but these are the ones that are most commonly used in SQL problems and because I know that it can be a bit confusing um to distinguish between row number dense Rank and rank here’s a visualization that you might find useful so let’s say that we have a list of values uh which are these ones and we are ordering them in descending order so you can see that there’s quite some repetition in these values and given this list of values how would these different numbering functions work on them right so here’s row number row number is easy it just um assigns a unique number to to each of them so it doesn’t matter that the values are sometimes the same you sort of arbitrarily pick um one to be one the other to be two and then you have three and then here you have 10 10 10 but it doesn’t matter you just want to order them so you uh do four five six and then finally seven dense rank is actually cares about the values being the same so 50 and 50 they both get one uh 40 gets two and then uh the 10 get three and then five gets four so easy the rank just grows uh using all the integer numbers dense rank is also assigning rank one to 50 and 50 but it’s also throwing away the two because there are two elements in here then the next one is getting rank three because the two has already been used and then the next batch 1011 is getting rank four but it’s also burning five and six and the next one then can only get rank seven so these are the differences between row number dance Rank and rank visualized we have now reached the end of our journey through the SQL fundamentals I hope you enjoyed it and I hoped that you learned something new you hopefully now have some understanding of the different components of SQL queries and the order in which they work and how they come together to allow us to do what we need with the data now of course learning the individual components and understanding how they work is only half the battle the other half of the battle is how do I put these pieces together how do I use them to solve real problems and in my opinion the response to that is not more Theory but it’s exercises go out there and do SQL challenges do SQL interviews find exercises or even better find some data that you’re interested in upload it in big query and then try to analyze it with SQL I should let you know that I have another playlist where I am solving 42 SQL exercises in postrest SQL and I think this can be really useful to get the other half of the course which is doing exercises and knowing how to face real problems with SQL and I really like this playlist because I’m using a free website a website that doesn’t require any sign up or any login uh it just works works and you get a chance to go there and do all of these exercises that cover all the theory that we’ve seen in this course and then after trying it yourself you get to see me solving it and my thought process and my explanation and I think it could be really useful if you want to deepen your SQL skills but in terms of uh how do I put it all together how do I combine all of this stuff I do want to leave you with another resource that I have created which is this table and this table shows you the fundamental moves that you will need to do whenever you do any type of data analytics and I believe that every sort of analytics that you might work on no matter how simple or complicated can ultimately be reduced to these few basic moves and what are these moves they should actually be quite familiar to you by now so we have joining and this is where we combine data from multiple tables based on some connections between columns and in SQL you can do that with the join then we have filtering filtering is when we pick certain rows and discard others so you know let’s look only at customers that joined after 2022 now how do you do that in SQL there are a few tools tools that you can use to do that the most important one is the wear filter and the wear filter comes in action right after you’ve loaded your data and it decides which rows to keep which rows to discard having does just the same except that it works on aggregated fields it works on fields that you’ve obtained after a group by qualify we actually haven’t seen it in this course because it’s not a universal component of SQL certain systems have it others don’t but qualify is basically also a filter and it works on the result of window functions and finally you have distinct which runs quite at the end of your query and it’s basically removing all duplicate rows and then of course you have grouping and aggregation and we’ve seen this in detail in the course you subdivide the data um on certain dimensions and then you calculate aggregate values within those Dimensions fundamental for analytics how do we aggregate in SQL we have the group by we have the window functions and for both of them we use aggregate functions such as sum average and so on and then we have column Transformations so this is where you apply logic uh arithmetic to transform columns combine column values and take take the data that you have in order to compute data that you need and we do this where we write the select right we can write calculations that involve our columns we have the case when which allows us to have a sort of branching logic and decide what to do based on some conditions and of course we have a lot of functions that make our life easier by doing specific next we have Union Union is pretty simp simple take tables that have the same columns and stack them together meaning put their rows together and combine them and finally we have sorting which can change how your data is sorted when you get the result of your analysis and can be also used in window functions in order to number or rank our data and these are really the fundamental elements of every analysis and every equal problem that you will need to solve so one way to face a problem even if you are finding it difficult is to come back to these fundamental components and try to think of how do you need to combine them in order to solve your problem and how can you take your problem and break it down to simpler operations that involve these steps now at the beginning of the course I promised you that uh we we would be solving a hard squl challenge together at the end of the course so here it is let us try now to solve this challenge applying the concepts in this course now as a quick disclaimer I’m picking a hard challenge because it’s sort of fun and it gives us um a playground to Showcase several Concepts that we’ve seen in the course and also because I would like to show you that even big hard scary ch Alles that are marked as hard and even have advanced in their name can be tackled by applying the basic concepts of SQL however I do not intend for you to jump into these hard challenges um from the very start it would be much better to start with basic exercises and do them step by step and be sure that you are confident with the basic steps before you move on to more advanced steps so if you have trouble uh approaching this problem or even understanding my solution don’t worry about it just go back to your exercises and start from the simple ones and then gradually build your way up that being said let’s look at the challenge marketing campaign success Advanced on strata scratch so first of all we have one table that we will work on for this challenge marketing campaign so marketing campaign has a few columns and it actually looks like this okay so there’s a user ID created that product ID quantity price now when I’m looking at the new table the one question that I must ask to understand it is what does each row represent and just by looking at this table I can have some hypotheses but I’m actually not sure what each row represents so I better go and read the text until I can get a sense of that so let’s scroll up and read you have a table of inapp purchases by user okay so this explains my table what does each row represent it represents an event that is a purchase okay so it means that user ID 10 bought product ID 101 in a quantity of three at the price of 55 and created that tells me when this happened so this happened 1st of January 2019 so great now I understand my table and now I can see what the problem wants from me let’s go on and read the question so I have a table of inapp purchases by users users that make their first inapp purchase are placed in a marketing campaign where they see call to actions for more Ina purchases find the number of users that made additional purchases due to the success of the marketing campaign the marketing campaign doesn’t start until one day after the initial app purchase so users that made one or multiple purchases on the first day do not count nor do we count users that over time purchase only the products they purchased on the first day all right so that was a mouthful okay so this on the first run it’s actually a pretty complicated problem so our next task now is to understand this text and to simplify it to the point that we can convert it into code okay and a good intermediate step before jumping into the code is to write some notes and we can use the SQL commenting feature for that so what I understand from this text is that users make purchases and we are interested in users that make additional purchases we’re interested in users who make additional purchases thanks to this marketing campaign how do we Define additional purchases additional purchase is defined as and the fundamental sentence is this one users that made one or multiple Pur purchases on the first day do not count so additional purchase happens after the first day right nor do we count users that over time purchase only the products they purchased on the first day so the other condition that we’re interested in is that it involves a product that was not bought the first day and finally what we want is the number of users so get the number of these users that should be a good start for us to begin writing the code so let us look at the marketing campaign table again and I remind you that each row represents a purchase so what do we need to find First in this table so we want to compare purchases that happen on the first day with purchases that happen the following day so we need a way to count days and what do we mean first day and following days do we mean the first day that the shop was uh open no we actually mean the first day that the user ordered right because the user signs up does the first order and then after that the marketing campaign starts so we’re interested in numbering days for each user such that we know what purchases happened on the first day what purchases happened on the second day third day and so on and what can we use to run a numbering by user we can use a window function with a numbering function right so I can go to my marketing campaign table and I can select the user ID and the date in which they bought something and the product ID for now now I said that I need a window function so let me start and Define the window now I want to count the days within each user so I will actually need to Partition by user ID so that each row only looks at the rows that correspond to that same user and then there is an ordering right there is a a sequence from the first day uh in which the user bought something to the second and the third and so on so my window will also need an ordering and what column in my table can provide an ordering it is created at and then what counting function do I need to use here well the the way to choose is to say what happens when the same user made two two different purchases on the same date what do I want my function to Output do I want it to Output two different numbers as a simple count or do I want them want it to Output the same number and the answer is that I wanted to Output the same number because all of the purchases that happened on day one need to be marked as day one and all the purchases that have happened on day two need to be marked as day two and so on and so the numbering function that allows us to achieve this is Rank and if you remember ranking is works just like ranking the winners of a race everyone who shares the same spot gets the same number right and this is what we want to achieve here so let us see what this looks like now and let us order by user ID and created at let us now see our purchases now user 10 started buying stuff on this day they bought one product and the rank is one Let’s us actually give a better name to this column so that it’s not just rank and we can call it user day all right so this user id10 had first user day on the this date and they brought one product then at a later date they had their second user day and they bought another product and then they had a third now user 14 started buying on this date this was their first user day they bought product 109 and then the same day they bought product 107 and this is also marked as user day one so this is what we want and then at a later day they bought another product and this is marked as user day three remember with rank you can go from 1 one to three because this the F the spot marked as one has eaten the spot Mark as two that’s not an issue in this problem so we are happy with this now if we go back to our notes we see that we are interested in users who made additional purchases and additional means that it happen s after the first day and how can we identify purchases that happened after the first day well there’s a simple solution for this we can simply filter out rows that have a user day one right all of the rows where the user day is one represent purchases that the user made on their first day so we can discard this and keep only purchase that happened on the following days now I don’t really have a way to filter on this uh window function because as you recall from the order of SQL operation the window function happens here and the wear filter happens before that so the wear filter cannot be aware of what happens in the window function and the having also happens before it so I need a different solution to filter on this field what I need to do is to use a Common Table expression so that I can break this query in two steps so I’m going to wrap this logic into a table called T1 or I can call it purchases for it to be more meaningful and if I do select star from purchases you will see that the result does not change but what I can do now is to use the wear filter and make sure that the user day is bigger than one and if I look here you will see that I have all purchases which happened after the users first day but there is yet one last requirement that I have to deal with which is that the purchase must happen after the first day and it must involve a product that the user didn’t buy on the first day so how can I comply with this requirement now for all of the rows that represent a purchase I need to drop the rows that involve a product ID that the user bought the first day so if I find out that user 10 bought product 119 on day one this purchase does not count I’m not interested in it so how can I achieve this in code I’m already getting all the purchases that didn’t happen on day one and then I want another condition so I will say and product ID not in and here I will say products that this user bought on day one right it makes sense so this is all the filters I need to complete my problem show me all the purchases that happened not on day one and also make sure that the user didn’t buy this product on day one so what I need to do is to add a subquery in here and before I do that let me give a Alias to this table so so that I don’t get confused when I call it again in the subquery so this first version of purchases that we’re using we could call it next days because we’re only looking at purchases that happen after the first day whereas in the subquery we want to look at purchases but we’re interested in the ones that actually happened on day one so we could call this first day and and we can use a wear filter to say that first day user day needs to be equal to one so this is a way that we can use to look at the purchases that happened on the first day now when we make this list we need to make sure that we are use looking at the same user right and to do that we can say end first day user ID needs to be the same as next day’s user ID and this ensures that we’re looking at the same user and we’re not getting confused between users and finally what do we need from the list of first day purchases we need the list of products so let me first see if the query runs so it runs there’s no mistakes and now let us review the logic of this query we have purchases which is basically a list of purchases with the added value that we know if it happened on day one on day two on day three and so on and then we are getting all of these purchases the ones that happened after day one and we are also getting the the list of products that they this user bought on day one and we are making sure to exclude those products from our final list and this is a correlated subquery because it is a specific SQL query that provides different results for every row that must run for every row because in the first row we need to get the list of products that user ID 10 has bought on day one and make sure that this product is not in it um and then when we go to another row such as this one we need to get the list of all products that user 13 bought on day one and make sure that 118 is not in those products so this is why it’s a correlated subquery and the final step in our problem is to get the number of these users so instead of selecting star and getting all of the C columns I can say count distinct user ID and if I run this I get 23 checking and this is indeed the right solution so this is one way to solve the problem and hopefully it’s not too confusing but if it is don’t worry it is after all an advanced problem if you go to solution here I do think however that my solution is a bit clearer than what strata scratch provides this is actually a bit of a weird solution but that’s ultimately up to you to decide and I am grateful to strata scratch for providing problems that I can solve for free such as this one welcome to postgress SQL exercises the website that we will use to exercise our SQL skills now I am not the author of this website I’m not the author of these exercises the author is Alis D Owens and he has generously created this website for anyone to use and it’s free you don’t even need to sign up you can go here right away and start working on it I believe it is a truly awesome website in fact the best at uh what it does and I’m truly grateful to Alis there for making this available to all the way the website works is pretty simple you have a few categories of exercises here and you can select a session and once you select a session you have a list of exercises you can click on an exercise and then here in the exercise view you have a question that you need to solve and you see a representation of your three tables we’re going to go into this shortly and then you see your expected results and here in this text box over here you can write your uh answer and then hit run to see if it’s the correct one the results will appear in this lower quadrant over here and if you get stock you can ask for a hint um and uh here there are also a few keyboard shortcuts that you can use and then after you submit your answer uh or if you are completely stuck you can go here and see the answers and and discussion and that’s basically all there is to it now let’s have a brief look at the data and see what that’s about and the data is the same for all exercises and what we have here is the data about a newly opened Country Club and we have three tables here members represents the members of the country club so we have their surname and first name their address their telephone and uh the the date that which they joined and so on and then we have the bookings so whenever a member makes a booking into a facility that event is stored into this table and then finally we have a table of facility where we have information about each facility and U in there we have some some tennis courts some badminton courts uh massage rooms uh and so on now as you may know this is a standard way of representing how data is stored in a SQL system so you have um the tables and for each table you see the columns and for each column you see the name and then the data type right so the data type is the type of data that is allowed into this column and as you know each column has a single data type and you are not allowed to mix multiple data types within each column so we have a few different data types here and they have the postgress um name so in postgress an integer is a whole number like 1 2 3 and a numeric is actually a FL floating Point number such as 2.5 or 3.2 character varying is the same as string it represents a piece of text and if you wonder about this number in round brackets 200 it represents the maximum limit of characters that you can put into this piece of text so you cannot have a surname that’s bigger than 200 characters and then you have a time stamp which represents a specific point in time and this is actually all the data types that we have here and finally you can see that the tables are connected so in the booking table every entry every row of this table represent an event where a certain facility ID was booked by a certain member ID at a certain time for a certain number of slots and the facility ID is the same as the facility ID field in facilities and the M ID field field is the same as the M ID or member ID field in members therefore the booking table is connecting to to both of these table and these logical connections will allow us to use joins in order to build queries that work on all of these three tables together and we shall see in detail how that works finally we have an interesting Arrow over here which represents a self relation meaning that the members table has a relation to itself and if you and if you look here this is actually very similar to the example that I have shown in my U mental models course um for each member we can have a recommended bu field which is the ID of another member the member who recommended them into the club and this basically means that you can join the members table to itself in order to get at the same time information about a specific member and about the member who recommended them and we shall see that in the exercises and clearly the exercises run on post SQL and postgress is one of the most popular open-source SQL systems out there postgress SQL is a specific dialect of SQL which has some minor difference es from other dialects such as my SQL or Google SQL that used is used by bigquery but it is mostly the same as all the others if you’ve learned SQL with another dialect you’re going to be just fine postgress sqle does have a couple of quirks that you should be aware about but I will address them specifically as we solve these exercises now if you want to rock these exercises I recommend keep keeping in mind The Logical order of SQL operations and this is a chart that I have introduced and explained extensively in my mental models course where we actually start with this chart being mostly empty and then we add one element at a time making sure that we understand it in detail so I won’t go in depth on this chart now but in short this chart represents the logical order of SQL operations these are are all the components that we can assemble to build our SQL queries they’re like our Lego building blocks for for SQL and these components when they’re assembled they run in a specific order right so the chart represents this order it goes from top to bottom so first you have from then you have where and then you have all the others and there are two very important rules that each operation can only use data produced above it and an operation doesn’t know anything about data produced below it so if you can keep this in mind and keep this chart as a reference it will greatly help you with the exercises and as I solve the exercises you will see that I put a lot of emphasis on coming back to this order and actually thinking in this order in order to write effective queries let us now jump in and get started with our basic exercises so I will jump into the first exercise which is retrieve everything from a table so here I have my question and how can I get all the information I need from the facilities table and as you know all my data is represented here so I can check here to see where I can find the data that I need now as I write my query I aim to always start with the front part why start with the front part first of all it is the first component that runs in The Logical order so again if I go back to my chart over here I can see that the from component is the first and that makes sense right because before I do any work I need to get my data so I need to tell SQL where my data is so in this case the data is in the facilities table next I need to retrieve all the information from this table so that means I’m not going to drop any rows and I’m going to select all the columns and so I can simply write select star and if I hit run I get the result that I need here in this quadrant I can see my result and it fits the expected results now the star is a shortcut for saying give me all of The Columns of this table so I could have listed each column in turn but instead I took a shortcut and used a star retrieve specific columns from a table I want to print a list of all the facilities and their cost to members so as always let’s start with the front part where is the data that we need it’s in the facilities table again and now the question is actually not super clear but luckily I can check the expected results so what I need are two columns from this table which is name and member cost so to get those two columns I can write select name member cost hit run and I get the result that I need so if I write select star I’m going to get all the columns of the table but if I write the name of specific columns separated by comma I will get uh only those columns specifically control which rows are retrieved we need a list of facilities that charge a fee to members so we know that we’re going to work with the facilities table and now we need to keep certain rows and drop others we need to keep only the rows that charge a fee to members so what component can we use in order to do this if I go back to my components chart I can see that right after from we have the we component and the we component is used to drop rows that we don’t need right so in after getting the facilities table I can see I can say where member cost is bigger than zero meaning that they charge a fee to members and finally I can get all of the columns from this control which rows are retrieved part two so like before we want the list of facilities that charge a fee to members but our filtering condition is now a bit more complex because we need that fee to be less than 150th of the monthly maintenance cost so I copied over the code from the last exercise we’re getting the data from our facilities list and we’re filtering for those where the member cost is bigger than zero and now we need to add a new condition which is that that fee which is member cost is less than 150th of the monthly maintenance cost so I can take monthly maintenance over here and divide it by 50 and I have my condition now when I have multiple logical conditions in the wear I need to link them with the logical operator so SQL can figure out how to combine them because the final result of all my conditions needs to be a single value which is either true or false right so let’s see how to do this in my mental models course I introduced the Boolean operators and how they work so you can go there for more detail but can you figure out which logical operator do we need here to chain these two conditions as suggested in the question the operator that I need is end so I can put it here here and what end does is that both of these conditions need to be true for the whole expression to evaluate to true and for the row to be kept so only the rows where both of these conditions are true will be kept and all other rows will be discarded now to complete my exercise I just need to select a few specific columns because we don’t want to return all the columns here and I think that I will cheat a bit by copying them from the expected results and putting them here but normally you would look at the table schema and figure out which columns you need and that completes our exercise basic string searches produce a list of all facilities with the word tennis in their name so where is the data we need it’s in the CD facilities table next question do I need all the rows from this table or do I need to filter out some rows well I only want facilities with the word tennis in their name so clearly I need a filter therefore I need to use the wear statement how can I write the wear statement I need to check the name and I need to keep only facilities which have tennis in their name so I can use the like statement here to say that the facility needs to have tennis in its name but what this wild card signify is that we don’t care what precedes tennis and what follows tennis it could be zero or more characters before it and after it we just care to check that they have tennis in their name and finally we need to select all all the columns from these facilities and that’s our result beware like I said before of your use of the quotes So what you have here is a string it’s a piece of text that uh allows you to do your match therefore you need single quotes if you as it’s likely to happen used double quotes you would get an error here and the error tells you that the column tenis the does not exist because double quotes are used to represent column names and not pieces of text so be careful with that matching against multiple possible values can we get the details of facilities with id1 and id5 so where is my data is in the facilities table and do I need all the rows from this table or only certain ones I need only certain rows because I want those that have id1 and id5 so I need to use a wear statement Now what are my conditions here their ID actually facility ID equals 1 and facility ID equals 5 so I have my two logical conditions now what operator do I need to use in order to chain them I need to use the or operator right because only one of these need needs to be true in order for the whole expression to evaluate to true and in fact only one of them can be true because it’s impossible for the idea of a facility to be equal to one and five at the same time therefore the end operator would not work and what we need is the or operator and finally we need to get all the data meaning all the columns about this facility so I will use select star the problem is now solved but now let’s imagine that tomorrow we need this query again and we need to include another id id 10 so what we can do is put or facility ID equals 10 but this is becoming a bit unwieldy right because imagine having a list of 10 IDs and then writing or every time and it’s it’s not very scalable as an approach approach so as an alternative we can say facility ID in and then list the values like one and five so if I take this and make it into my condition I will again get the same result I will get the the solution but this is a more elegant approach and it’s also more scalable because it’s much easier to come back and insert other IDs inside this list so this is a preferred solution in this case and logically what in is doing is looking at the facility ID for each row and then checking whether that ID is included in this list if it is it returns true therefore it keeps the row if it’s not returns false therefore it drops the row and we shall see a bit later that the in uh notation is also powerful because in this case we have a static list of IDs we know that we want IDs one and five but in more advanced use cases instead of a static list we could provide another query a SQL query or a subquery that would dynamically retrieve a certain list and then we could use that in our query so we shall see that in later exercises classify result into buckets produce a list of facilities and label them cheap or expensive based on their monthly maintenance so we want to get our facilities do we need a filter do we need to drop certain rows no we actually don’t we want to get all facilities and then we want to label them and we need to select the name of the facility and then here we need to provide the label so what SQL statement can we use to provide a text level label according to the value of a certain column what we need here is a case statement which implements conditional logic which implements a branching right it’s similar to the if else statements in other programming languages because if the monthly maintenance cost is more than 100 then it’s expensive otherwise it’s cheap so this call for a case statement now I always start with case and end with end and I always write these at the beginning so I don’t forget them and then for each condition I write when and what is the condition that I’m interested in monthly maintenance being above 100 that’s my first condition what do I do in that case I output a piece of text which says expensive and remember single quotes for test text next I could write the next condition explicitly but actually if it’s not 100 then it’s less than 100 so all I need here is an else and in that case I need to Output the piece of text which says cheap and finally I have a new column and I can give it a label I can call it cost and I get my result so whenever you need to put values into buckets or you need to label values according to certain rules that’s usually when you need a case statement working with dates let’s get a list of members who joined after the start of September 2012 so looking at these tables where is our data it’s in the members table so I will start writing this and now do I need to filter this table yes I only want to keep members that joined after a certain time and now how can I run this the condition on this table I can say where join date is bigger than 2012 September 01 so luckily in SQL and in postgress filtering on dates is quite intuitive even though here we have a time stamp that represents a specific moment in time up to the second we can say bigger or equal actually because we also want to include those who joined on the first day we can write bigger or equal and just specify the the date and SQL will fill in the the rest of the remaining values and the filter will work and next we want to get a few columns for these members so I will copy paste here select and this solves our query removing duplicates and ordering results we want an ordered list of the first 10 surnames in the members table and the list must not contain duplicates so let’s start by getting our table which is the members table now we want to see the surnames so if I write this I will see that there are surnames which are shared by members so there are actually duplicates here so what what can we do in SQL in order to remove duplicates we have seen in the mental models course that we have the distinct keyword and the distinct is going to remove all duplicate rows based on the columns that we have selected so if I run this again I will not see any duplicates anymore now the list needs to be ordered alphabetically as I see here in the expected results and we can do that with the order by statement and when you use order by on a piece of text the default behavior is that the text is ordered alphabetically and uh if I were to use Des sending then it would be ordered in Reverse alphabetical order however that’s not what I need I need it in alphabetical order so now I see that they are ordered and finally I want the first 10 surnames so how can I return the first 10 rows of my result I can do that with the limit statement so if I say limit 10 I will get the first 10 surnames and since I have ordered alphabetically I will get the first 10 surnames in alphabetical order and this is my result now going back to our map over here we have the from which gets a table we have a where which drops rows that we don’t need from that table and then all the way down here we have the select which gets the columns that we need and then we have the distinct right and the distinct needs to know which columns we need because it’s it drops duplicates based on these columns so in this example over here we’re only taking a single column surname so the distinct is going to drop duplicate surnames and then at the end of it all when all the processing is done we can order our results and then finally once our results are ordered we can do a limit to limit the number of rows that we return so I hope this makes sense combining results from multiple queries so let’s get a combined list of all surnames and all facility names so where are the surnames there in CD members and from CD m mbers I can select surname right and this will give me the list of all surnames and where are the facility names there are in CD facilities and I could say select name from CD facilities and I would get a list of all the facilities now we have two distinct queries and they both produce a list or a column of text values and we want to combine them what does it mean we want to stack them on top of each other right and how does that work well if I just say run query like this I will get an error because I have two distinct query here queries here and they’re not connected in any way but when I have two queries or more defining tables and I want to stack them on top of each other I can use the union statement right and if I do Union here I will uh get what I want because all the surnames will be stacked uh vertically with all the names and I will get a unique list containing both of these columns now as I mentioned in the mental models course typically when you have just Union uh it means Union distinct and actually other systems like bigquery don’t allow you to write just Union they want you to specify Union distinct and what this actually does is that after stacking together these two tables it removes all duplicate rows and uh the alternative to this is Union all which um does not do this it actually keeps all the rows and as you know we have some duplicate surnames and then we get them here and it doesn’t fit with our result but if you write just Union it will be Union distinct and you won’t have any duplicates and if you look at our map for The Logical order of SQL operations we are getting the data from a certain table and uh filtering it and then doing all sorts of operations and um on on this data and then we are selecting The Columns that we need and then we can uh remove the the duplicates from this one table and then what comes next is that we could combine this table U with other tables right we can tell SQL that we want to Stack this table on top of another table so this is where the union comes into play and only after we have combined all the tables only after we have stacked them all up on top of each other we can order the results and limit the results also remember and I showed this in detail in the mental models course um when I combine two or more table tables with a union what I need is for them to have the exact same number of columns and all of the columns need to have the same data type so in this case both tables have one column and this column is a text so the the union works but if I were to add another column here and it’s an integer column it would not work because the union query must have the same number of columns right I will get an error however if I were to add an integer column in the second position in both tables they would work again because again I have the same number of columns and they have the same data type simple aggregation I need the sign up date of my last member so I need to work with the members table and we have a field here which is join date and I need to get the latest value of this date the time when a member last joined right so how can I do that I can take my join date field and run an aggregation on top of it what is the correct aggregation in this case it is Max because when it comes to dates Max will take the latest date whereas mean will take the earliest date and I can label this as latest and get the result I need now how aggregations work they are uh functions that look like this you write the name of the function and then in round brackets you provide the arguments the first argument is always the column on which to run the aggregation and what the aggregation does is that it takes a list of values could be 10 100 a million 10 million it doesn’t matter it takes a long list of values and it compresses this list to a single value it um does like we’ve seen in this case taking all of the dates and then returning the latest date now to place this in our map we get the data from the table we filter it and then sometimes we do a grouping which we we shall see later in the exercises but whether we do grouping or not here we have aggregations and if we haven’t done any grouping the aggregation works at the level of all the rows so in the absence of grouping as in this case the aggregation will look at all the rows in my table except for the rows that I filtered away but otherwise it will look at all the rows and then it will compress them into a single value more aggregation we need the first and last name of the last member who signed up not just the date so in the previous exercise we saw that we can say select Max join date from members and we would get the last join date the date when the last member signed up right so given that I want the first and the last name you might think that you can say first name and surname in here but this actually doesn’t work this gives an error the error is that the column first name must appear in the group by clause or be used in a aggregate function now the meaning behind this error and how to avoid it is described in detail in the mental models course in the group by section but the short version of it is that what you’re doing here is that with this aggregation you’re compressing join date to a single value but you’re doing no such compression or aggregation for first name and surname and so SQL is left with the um instruction to return something like this and as you can see here we have a single value but for these columns we have multiple values and this does not work in SQL because you need all columns to have the same number of values and so it it throws an error and what we really need to do here is to take this maximum join date and use it in a wear filter because we only want to keep that row which corresponds to the latest join date so we can take the members table and get the row where join date is equal to the max join date and from that select the name and the surname unfortunately this also doesn’t work so what we saw in the course is that you cannot you’re not allowed to use aggregations inside wear so you cannot use max inside where and the reason why is that actually pretty clear because aggregations happen at this stage in the in the process and aggregations need to know whether a group ey has occurred or not they need to know whether they have to happen over all the rows in the table or only within the groups defined by the group ey and when we are at the where stage the groupy hasn’t happened yet so we don’t know at which level to execute the aggregations and because of this we are not allowed to do aggregations inside the where statement so how can we solve the problem now well a a sort of cheating solution would be if we knew the exact value of join date we could place it here and then our filter would work we’re not using an aggregation and we could put join date in here to display it as well and that would would work however this is a bit cheating right because um the maximum join date is actually a dynamic value it will change with time so we don’t want to hardcode it we want to actually um compute it but because this is not allowed what we actually need is a subquery and the subquery is a SQL query that runs within a query to return a certain result and we can have a subquery by opening round brackets here and write writing a a query and in this query we need to go to the members table and select the maximum join date and this is our actual solution so in this execution you can imagine that SQL will go here inside the subquery run this get the maximum jointed place it in the filter uh keep only the row for the latest member who has joined and then retrieve what we need about this member let us now move to the joints and subqueries exercises the first exercise retrieve the start times of members bookings now we can see that the information we need is spread out into tables because we want the start time for bookings that and that information is in the bookings table but we want to filter to only get members named David farel and the name of the member is contained in the members table so because of that we will need a join so if we briefly look at the map for the order of SQL operations we we can see here that from and join are really the same uh step um and how this works is that in the from statement sometimes uh all my data is in one table and then I just provide the name of that table but sometimes I need to combine two or more different tables in order to get my data and in that case I would use the join but everything in SQL works with tables right so when I when I take two or more tables and combine them together at the end all I get is just another table and this is why from and join are actually the same component and they are the same step so as usual let us start with the front part and we need to take the booking table and we need to join it on the members table and I can give an alas to each table to make my life easier so I will call this book and I will call this mem and then I need to specify The Logical condition for joining this table and The Logical condition is that the M ID column in the booking table is really the same thing as the M ID column in the members table concretely you can imagine um SQL going row by Row in the booking table and looking at the M ID and then checking whether this m ID is present in the members table and if it’s present it combines the row uh the current Row from bookings with the matching Row for members does this with all the matching rows and then drops rows which don’t have a match and we saw that in detail in the mental models course so I’m not going to go in depth into it now that we have our table which is uh comes from the joint of members and bookings we can properly properly filter it and what we want is that the first name column is David in the column which comes from the members table right so m. first name is indicating the parent table and then the column name and the surname is equal to FAL and remember single quotes when using pieces of text this is a where filter you have two logical conditions and then we use the operator end because both of them need to be true so now we have uh filtered our data and finally we need to select the start time and that’s our query now remember that when we use join in a query what’s implied is that we are using inner join and there are several types of join but inner joint is the most common so it’s the default one and what inner joint means is that it’s going to return uh from the two tables that we’re joining is going to return only the rows that have a match and all the row that don’t have a match are going to be dropped so if there’s a row in bookings and it has a m ID that doesn’t exist in the members table that row will be dropped and conversely if there’s a row in the members table and it has a m ID that is not referenced in the booking table that row will also be dropped and that’s an inner join work out the start times of bookings for tennis courts so we need to get the facilities that are actually tennis courts and then for each of the facility we’ll have several bookings and we need to get the start time for those uh bookings and it will be in a specific date so we know that we need the data from these two tables because the name of the facility is here but the data about the bookings is here so I will go from CD facilities join CD bookings on what are the fields that we can join on logically now let me first give an alias to these tables so I will call this fox and this I will call book and now what I need to see is that the facility ID matches on both sides now we can work work on our filters so first of all I only want to look at tennis courts and if you look at the result here um it means that in the name of the facility we want to see tennis and so we can filter on uh string patterns on text patterns by using the like uh command so I can take facilities name and get it like tennis and the percentage signs are um wild cards which means that tennis could be preceded and followed by zero or more characters we don’t care we just want to get those strings that have tennis in them but that’s not enough as a condition we also need the booking to have happened on a specific date so I will put an end here so end is the operator we need because we’re providing two logical conditions and they both need to be true so end is what we need and then I can take the start time from the booking table and um say that it should be equal to the date provided in the instructions because I want the booking to have happened in this particular date however this will not work so I can actually complete the query and show you that it will not work because here we get zero results so can you figure out why this um command here did not work now I’m going to write a few comments here and uh this is how you write them and they are just pieces of text they’re not actually executed as code and I’ll just use them to show you what’s going on so the value for start time looks like this so this is a time stamp and is showing a specific point in time but the date that we are providing for the comparison looks like this so as you can see we have something that is uh less granular because we we’re not showing all of this data about hour minute and uh and second now in order to compare these two things which are different SQL automatically fills in uh this date over here and what it does is that since there’s nothing there it puts zeros in there and now that it has made this um extension it’s going to actually compare them so when you look at this uh comparison over here between these two elements this comparison is false false because the hour is different now when we write this uh filter command over here SQL is looking at every single start time and then comparing it with this value over here which is the very first moment of that date but there’s no start time that is exactly like this one so basically this is always false and thus we get uh zero rows in our result so what is the solution to this before when we take a start time from the data before comparing it we can put it into the date function and if I take my example here if I put it into the date function it’s going to drop that extra information about hour minute and second and it’s only going to keep uh the information about the date so once I do this if I uh if I pass it to the date function before comparing it to my reference date now this one is going to become the result which is this one and then I’m going to compare it with my reference date and then this is going to be true so all this to say that before we compare start time with our reference date we need to reduce its granularity and we need to reduce it to its uh to its date so if I run the query now I will actually get my start times and after this I just need to add the name and finally I need to order by time so I need to order bu um book start time there is still a small error here so sometimes you just have to look at what you get and what’s expected and if you notice here we are returning data about the table tennis facility but we’re actually just interested in tennis court so what are we missing here the string filter is not precise enough and we need to change this into tennis court and now we get our results produce a list of all members who have recommended another member now if we look at the members table we have all these data about each member and then we know if they were recommended by another member and recommended by is the ID of the member who has recommended them and because of this the members table like we said has a relation to itself because one of its column references its ID column so let’s see how to put this in practice so to be clear I simply want a list of members who appear to have recommended another member so if I wanted just the IDS of these people my task would be much simpler right I would go to the members table and then I could select recommended by and then I will put a distinct in here to avoid repetitions and what I would get here is the IDS of all members who have recommended another member however the problem does not want this because the problem wants the first name and Sur name of these uh of these people so in order to get the first name and the name of these people I need to plug this ID back into the members table and get the the data there so for example if I went to the members table and I selected everything where the M ID is 11 then I would get the data for this first member but now I need to do this for all members so what I will have to do is to take the members table and join it to itself and the first time I take the table I’m looking at the members quite simply but the second time I take the members table I’m looking at data about the recommenders of the members so I will call this second instance re so both of these they come from the same table but they’re now two separate instances and what is the logic to join these two tables the members table has this recommended by field and we take the ID from recommended by and we plug it back into the table into M ID to get the data about the recommenders and now we can go into the recommenders which we got by plugging that ID and get their first name and surname I want to avoid repetition because a member may have been recommending multiple members but I want to avoid repetition so I will put a distinct to make sure that I don’t get any uh repeated rows at the end and then finally I can order by surname and first name and I get my result so I encourage you to play with this and experiment a bit until it is clear and in my U mental models course I go into depth into the self joint and uh do a visualization in Google Sheets that also makes it uh much clearer produce a list of all members along with a recommender now if we look at the members table we have a few column and then we have the recommended by column and sometimes we have the ID of another member who recommended this member um it can be repeated because the same member may have recommended multiple people and then sometimes this is empty and when this is empty we have a null in here which is the value that SQL uses to represent absence of data now let us count the rows in members so you might know that to count the rows of a table we can do a simple aggregation which is Count star and we get 31 and let’s just make a note of this that members has 31 rows because in the result we want a list of all members so we must ensure that we return 31 rows in our results now I’m going to delete this select and as before I want want to go for each member and check the ID they have here in recommended bu and then plug this back into the table into M ID so I can get the data about the recommender as well and I can do that with a self jooin so let me take members and join on itself and the first time I will call it Ms and the second time I will call it Rex and the logic for joining is that in Ms recommended by is the same um is connected to to Rex M ID so this is taking the ID in the recommended by field and plugging it back into me ID to get the data about the recommender now what do I want from this I want to get the first name of the member and the last name uh surname and then the first name and last name of the recommender uh surname great so it’s starting to look like the right result but how many rows do we think we have here and in order to count the rows I can do select count star from and then if I simply take this table uh if I simply take this query and en close it in Brackets now this becomes a a subquery so I can ah the subquery must have an alias so I can give it an alias like this and I get 22 so how this works is that first SQL will compute the content of the subquery which is the table that we saw before and then it will uh we need to assign it an alas otherwise it doesn’t work this changes a bit by System but in post you need to do this so we we call it simply T1 and then we run a count star on this table to get the number of rows and we see that the result has 22 rows and this is an issue because we saw before that members has 31 rows and that we want to return all of the members therefore our result should also have 31 rows so can you figure figure out why are we missing some rows here now the issue here is that we are using an inner join so remember when we don’t specify the type of joint it’s an inner joint and what does an inner joint do it keeps only rows that have matches so if you we saw before that in members sometimes this field is empty it has a null value because U you know maybe the member wasn’t recommended by anyone maybe they just apply it themselves and what happens when we use this in an inner joint and it has a null value the row for that me member will be dropped because obviously it cannot have a match with M ID because null cannot match with with anything with any ID and so that row is dropped and we lose it however that’s not what we want to do therefore instead of an inner join we need to use a left join here the the left join will look at the left table so the table that is left of the join command and it will make sure to keep all the rows in that table even the rows that don’t have a match in the rows that don’t have a match it will not drop them it will just put a null in the values that correspond to the right table and if I run the count again uh I will get 31 so now I have I’m keeping all the members and I have the number of rows that I need so now I can get rid of all of these because I know I have the right amount of of rows and I can um get my selection over here and it would actually help if we could make this a bit uh more ordered and a assign aliases to the columns so I will follow the expected results here and call this m first name me surname W first name Rec surname now we have the proper labels and you can see here that we always have the name of the member but some member weren’t recommended by anyone and therefore for the first and last name of the recommender we simply have null values and this is what the left join does the last step here is to order and we want to order by the last name and the first name of each member and we finally get our result so typically you use inner joints which is the default joint because you’re only interested in the rows from both tables that actually have a match but sometimes you want to keep all the data about one table and then you would put that table on the left side and do a left join as we did in this case produce a list of all members who have used a tennis court now now for this problem we need to combine data from all our tables because we need to get look at the members and we need to look at their bookings and we need to check what’s the name of the facility for their bookings so as always let us start with the front part and let us start by joining together all of these tables CD facilities on facility ID and then I want to also join on members and that is my join so we can always join two or more tables in this case we’re joining three tables and how this works is that the first join creates a new table and then this new table is joined with the with the next one over here and this is how multiple joints are managed now I have my table which is the join of all of these tables and um we we’re only interested in members who have used the tennis court if a member has made no bookings um we are we don’t we’re not interested in that member and so it’s okay to have a join and not a left join and we’re for each booking we want to see the name of the facility and if there was a booking who didn’t have the name of the facility we wouldn’t be interested in that booking anyway and so um this joint here also can be an inner join and doesn’t need to be a left join this is how you can think about whether to have a join or left join now we want the booking to include a tennis court so we can filter on this table and we will look at the name of the facility and uh make sure that it has tennis court in it with the like operator and now that we have filtered we can get the first name and the surname of the member and we can get the facility name so here we have a starting result now in the expected result we have merged the first name and the surname into a single string and um in SQL you can do this with a concatenation operator which is basically taking two strings and putting them together into one string now if I do this here I will get um something like this and so this looks a bit weird and what I want to do here is to add an empty space in between and again concatenate it and now the names will look uh will look fine I also want to label this as member and this other column as facility to match the expected results next I need to ensure that there is no duplicate data so at the end of it all I will want to have distinct in order to remove duplicate rows and then I want to order the final result by member name and facility name so order by member and then facility and this will work because the order bu coming second to last coming at the end of our logical order of SQL operations over here the order by is aware of the alas is aware of the label that I have that I have put on the columns and here I get the results that I needed not a lot happening here to be honest it’s just that we’re joining three tables instead of two but it still works um just like uh any other join and then concatenating the strings filtering according to the facility name and then removing duplicate rows and finally ordering produce a list of costly bookings so we want to see all bookings that occurred in this particular particular day and we want to see how much they cost the member and we want to keep the bookings that cost more than $30 so clearly in this case we also need the information from from all tables because if you look at the expected results we want the name of the member which is in the members table the name of the facility which is in the facilities table and the cost for which we will need the booking table so we need to start with a join of these three tables and since we did it already in the last exercise I have copied the code for that uh join so if you want more detail on this go and check the last exercise as well as I have copied the code to get the first name of the member by concatenating strings and the name of the of the facility now we need to calculate the cost of each booking so how does it work looking at our data so we have here a list of bookings and um a booking is defined as a number of slots and a slot is a one uh is a 30 minute usage of that facility and then we also have mid which tells us whether the member is a guest or not I mean whether the person is a guest or a member because if mid is zero then that person is a guest otherwise that person is a member and then I also know the facility that this person booked and if I go and look at the facility it has uh two different prices right one price uh is for members the other price is for guests and the price applies to the slots so we have all of the ingredients that we need for the cost in our join right and to convince ourselves of that let us actually select the here so in Booking I can see facility ID member ID and then slots and then in facility I can see the member cost the guest cost and I guess that’s all I need really to calculate the cost and as you can see after the join I’m in a really good position because for each row I do have all of these values placed on each row so now I just have to figure out how to combine all of these values in order to get the cost now the way that I can get the cost is that I can look at the number of slots and then I need to multiply this by the right cost which is either member cost or guest cost and how do I know which of these to pick if it depends on the M ID if the M id M ID is zero then I will use the guess cost otherwise I will use the member cost so let me go back to my code here and after this I can say I want to take the slots and I want to multiply it by either member cost or guest cost now how can I put some logic in here that will choose uh either member cost or guest cost based on the ID of this person what can I use in order to make this Choice whenever I have such a choice to make I need to use a case statement so I can start with a case statement here and I will already write the end of it so that I don’t forget it and then in the case statement M what do I need to check for I need to check that the member ID is zero in that case I will use the guest cost and in all other cases I will use the member cost so I’m taking slots and then I’m using this case when to decide by which column I’m going to multiply it and this is actually my cost now let’s take a look at this and so I get this error that the column reference M ID is ambiguous so can you figure out why I got this error what’s happening is that I have joined U multiple tables and the M ID column appears twice now in my join and so I cannot refer to it just by name because SQL doesn’t know which column I want so I have to to reference the parent of the column every time I use it so here I will say that it comes from the booking table and now I get my result so if I see here then um I can see that I have successfully calculated my cost and let’s look at the first row uh first it’s um the me ID is not zero therefore it’s a member and here the member cost is zero meaning that this facility is free for members so regardless of the slots the cost will be zero and let’s look at one who is a guest so this one uh is clearly a a guest and they have uh taken one slot and the member cost is zero but uh so it’s free for members but it costs five per slot for guests so the total cost is five So based on this sanity check the cost looks good now I need to actually filter my table because we have um we should consider only bookings that occurred in a certain day so after creating my new table uh and joining I can write aware filter to drop the rows that I don’t need and I can say this is the the time column that I’m interested in the start time needs to be equal to this date over here and we have seen before that this will not work because start time is a Tim stamp it also shows hour um minute and seconds whereas here is just a date so this comparison will fail and so before I do the comparison I need to take this and reduce it to a date so that I’m comparing Apples to Apples on the time check that that didn’t break anything now we should have significantly fewer rows so now what we need to do is to only keep rows that have a cost which is higher than 30 so can I go here and say end cost bigger than 30 no I cannot do it column cost does not exist right typical mistake but if you look at the logical order of SQL operations first you have the sourcing of the data then you have the wear filter and then all of the logic um by which we calculate the cost happens here and the label cost happens here as well so we cannot um filter on this column on the column cost because the we component has no idea about the uh column cost so this will now work but what we can do is to take all of the logic we’ve done until now and wrap it in round brackets and then introduce a Common Table expression and call this T1 so I will say with T1 as and then I can from T1 and now I can use my filter right so cost bigger than 30 I can select star from this table and I’m starting to get somewhere because the cost has been successfully filtered now I have a lot of columns that I don’t want in my final result that I used to help me reason about the cost so I want to keep member and I want to keep the facility but I don’t want to keep any of these great now as a final step I need to order by cost descending and there’s actually a issue that I have because I copy pasted code from the previous exercise I kept a distinct and you have to be very careful with this especially if you copy paste code anyway for learning it would be best to write it always from scratch but the distinct will remove uh rows that are duplicate and can actually cause an issue now I remove the distinct and I get the um solution that I want and if you look here we have if you look at the last two rows you can see that they’re absolutely identical and so the distinct would remove them but there are two uh bookings that happen to be just the same uh in our data and we want to keep them we don’t want to delete them so having distinct was a mistake in this case to summarize what we did here first we joined all the tables so we could have all the columns uh that we needed side by side and then we filtered on on the date pretty straightforward and then we took the first name and surname and um concatenated them together as well as the facility name and then we computed the cost and to compute the cost we got the number of slots and we used used a case when to multiply this by either the guest cost or the member cost according to the member’s ID and at the end we wrapped everything in a Common Table expression so that we could filter on this newly computed value of cost and keep only those bookings that had a cost higher than 30 now I am aware that the question said not to use any subqueries technically I didn’t because this is a common table expression but if you look at the author solution it is slightly different than ours so here they did basically the same thing that we did to compute the the cost except that in the case when they inserted the whole uh expression which is fine works just the same the difference is that um in this case they added a lot of logic in the we filter so that they could use a we filter in the first query so clearly they didn’t use any columns that were added at the stage of the select they didn’t use cost for example because like we said that wouldn’t be possible so what they did is that they added the date filter over here and then in this case they added a um logical expression and in this logical expression either one of these two needed to be true for us to keep the row either the M ID is zero meaning that it’s a it’s a guest and so the calculation based on Guess cost ends up being bigger than 30 or the M ID is not zero which means it’s a member and then this calculation based on the member cost ends up being bigger than 30 so this works I personally think that there’s quite some repetition of the cost calculation both by putting it in the we filter and by uh putting it inside the case when and so I think that uh the solution we have here is a bit cleaner because we’re only calculating cost once uh in this case and then we’re simply referencing it thanks to the Common Table expression so if you look at the mental models course you will see that I warmly recommend not repeating logic in the code and using Common Table Expressions as often as possible because I think that they made the code uh clearer and um simpler to to understand produce a list of all members and the recommender without any joins now we have already Sol solved this problem and we have solved it with a self join as you remember we take the members table and join it on itself so that we can get this uh recommend by ID and plug it into members ID and then see the names of both the member and the recommender side by side but here we are challenged to do it without a join so let us go to the members table and let us select the first name and the surname now we actually want want to concatenate these two into a single string and call this member now how can we get data about the recommender without a self-join typically when you have to combine data you always have a choice between a join in a subquery right so what we we can do is to have a subquery here which looks at the recommended by ID from this table and um goes back to the members table and gets the the data that we need so let’s see how that would look let us give an alias to this table and call it Ms and now we need to go back to this table inside the subquery and we can call it Rex and we want to select again the first name and surname like we’re doing here and how are we able to identify the right row inside this subquery we can use aware filter and we want the Rex M ID to be equal to the Mims recommended by value and once we get this value we can call this recommender and now we want to avoid duplicates so after our outer select we can say distinct which will remove any duplicates from the result and then we want to sort I guess by member and recommender and here we get our result so replacing a join with a subquery so we go row by Row in members and then we take the recommended by ID and then we query the members table again inside the subquery and we use the wear filter to plug in that recommended by and find the row where the mem ID is equal to it and then getting first name and surname we get the data about the recommender and uh and that’s how we can do it in the mental models course we discuss the subqueries and um and this particular case we talk about a correlated subquery why is this a correlated subquery because you can imagine that the the query that is in here it runs again for every row because for every row row I have a different value recommended by and I need and I need to plug this value into the members table to get the data about the recommender so this is a correlated subquery because it runs uh every time and it is different for every row of the members table produce a list of costly bookings using a subquery so this is the exact exercise that we did before and as you will remember uh we actually ignored it instructions a bit and we did use not a subquery but a Common Table expression and by reference this is the code that we used and this code works with that exercise as well and we get the result so you can go back to that exercise to see the logic behind this code and why this works and if we look at the author’s uh solution they are actually using a subquery instead of a common table expression so they have an outer quer query which is Select member facility cost from and then instead of the from instead of telling the name of the table they have all of this logic here in this subquery which they call bookings and finally they they add a filter and order now this is technically correct it works but I’m not a fan of uh of writing queries like this I prefer writing them like this as a common table expression and I explain this in detail in my mental models course the reason I prefer this is because U it doesn’t break queries apart so in my case this is one query and this is another query and it’s pretty easy and simple to read however in this case you will start reading this query and then it is broken uh in in two by another query and when people do this sometimes they go even further and here when you have the from instead of a table you have yet another subquery it gets really complicated um so because of these uh two approaches are equivalent I definitely recommend going for a Common Table expression every time and avoiding subqueries unless they are really Compact and you can fit them in one row let us now get started with aggregation exercises and the first problem count the number of facilities so I can go to the facilities table and then when I want to count the number of rows in a table and here every row is a facility I can use the countar aggregation and we get the count of facilities so what we see here is a global aggregation and when you run an aggregation without having done any grouping it runs on the whole table therefore it will take all the rows of this table no matter how many compress them into one number which is determined by the aggregation function in this case we have a count and it returns a total of nine rows so in our map aggregation happens right here so we Source the table we filtered it if needed and then we might do a grouping which we didn’t do in this case but whether we do it or not aggregations happen here and if grouping didn’t happen the aggregation is at the level of the whole table count the number of expensive facilities this is similar to the previous exercise we can go to the facilities table but here we can add a filtering because we’re only interested in facilities that have guest cost greater than or equal to 10 and now once again I can get my aggregation count star to count the number of rows of this resulting table looking again at our map why does this work because with the from We’re sourcing the table and immediately after the wear runs and it drops unneeded rows and then we can decide whether to group by or not and in our case in this case we’re not doing it um but then the aggregations Run so by the time the aggregations run I’ve already dropped the rows in the wear and this is why in this case after dropping some rows the aggregation only sees six rows which is what we want count the number of recommendations each member makes so in the members table we have a field which is recommended by and here is the ID of the member who recommended the member that that this row is about so now we want to get all these uh recommended by values and count how many times they appear so I can go to my members table and what I need to do here is to group by recommended by so what this will do is that it will take all the unique values of this column recommended by and then you will allow me to do an aggregation on all of the rows in which those values occur so now I can go here to select and call this column again and if I run this query I get all the unique values of recommended buy without any repetitions and now I can run an aggregation like count star what this will do is that for recomend recomended by value 11 it will run this aggregation on all the rows in which recommended by is 11 and the aggregation in this case is Count star which means that it will return the number of rows in which 11 appears which in the result happens to be one and so on for all the values what I also want to do is to order by recommended buy to match the expected results now what we get here is almost correct we see all the unique values of this column and we see the number of times that it appears in our data but there’s one discrepancy which is this last row over here so in this last row you cannot see anything which means that it’s a null value so it’s a value that represents absence of data and why does this occur if you look at the original recommended by column there is a bunch of null values in this column because there’s a bunch of member that have null in recommended by so maybe we don’t know who recommended them or maybe they weren’t recommended they just applied independently when you group bu you take all the unique values of the recommended by column and that includes the null value the null value defines a group of its own and the count works as expected because we can see that there are nine members for whom we don’t have the recommended by value but the solution does not want to see this because we only want to see the number of recommendations each member has made so we actually need to drop this row therefore how how can I drop this row well it’s as simple as going to uh after the from and putting a simple filter and saying recommended by is not not null and this will drop all of the rows in which in which that value is null therefore we won’t appear in the grouping and now our results are correct remember when you’re checking whether a value is null or not you need to use the is null or is not null you cannot actually do equal or um not equal because um null is not an act ual value it’s just a notation for the absence of a value so you cannot say that something is equal or not equal to null you have to say that it is not null let’s list the total slots booked per facility now first question where is the information that I need the number of slots booked in the is in the CD bookings and there I also have the facility ID so I can work with that table and now how can I get the total slots for each facility I can Group by facility ID and then I can select that facility ID and within each unique facility ID what type of uh aggregation might I want to do in every booking we have a certain number of slots right and so we want to find all the bookings for a certain facility ID and then sum all the slots that are being booked so I can write sum of slots over here and then I want to name this column total slots uh looking at the expected results but this will actually not work because um it’s it’s two two separate words so I actually need to use quotes for this and remember I have to use double quotes because it’s a column name so it’s always double quotes for the column name and single quotes for pieces of text and finally I need to order by facility ID and I get the results so for facility ID zero we looked at all the rows where facility ID was zero and we squished all of this to a single value which is the unique facility ID and then we looked at all the slots that were occurring in these rows and then we compress them we squished them to a single value as well using the sum aggregation so summing them all up and then we get the slum the sum of the total slots list the total slots booked per facility in a given month so this is similar to the previous problem except that we are now isolating a specific time period And so let’s us think about how we can um select bookings that happened in the month of September 2012 now we can go to the bookings table and select the start time column and to help our exercise I will order by start time uh descending and I will limit our results to 20 and you can see here that start time is a time stamp call and it goes down to the second because we have year month day hour minutes second so how can we check whether any of these dates is corresponds to September 2012 we could add a logical check here we could say that start time needs to be greater than or equal to 2012 September 1st and it needs to be strictly smaller than 2012 October 1st and this will actually work as an alternative there is a nice function that we could use which is the following date trunk month start time let’s see what that looks like so what do you think this function does like the name suggests it truncates the date to a specific U granularity that we choose here and so all of the months are reduced to the very first moment of the month in which they occur so it is sort of cutting that date and removing some information and reducing the granularity I could of course uh have other values here such as day and then every um time stem here would be reduced to its day but I actually want to use month and now that I have this I can set an equality and I can say that I want this to be equal to September 2012 and this will actually work and I also think it’s nicer than the range that we showed before now I’ve taken the code for the previous exercise and copied it here because it’s actually pretty similar except that now after we get bookings we need to insert a filter to isolate our time range and actually we can use this logical condition directly I’ll delete all the rest and now what I need to do is to change the ordering and I actually need to order by the the total slots here and I get my result to summarize I get the booking table and then I uh take the start time time stamp and I truncate it because I’m only interested in the month of that of that time and then I make sure that the month is the one I actually need and then I’m grouping by facility ID and then I’m getting the facility ID and within each of those groups I’m summing all the slots and finally I’m ordering by this uh column list the total slots booked per facility per month in the year 2012 so again our data is in bookings and now we want to see how we how can we isolate the time period of the year 2012 for this table now once again I am looking at the start time column from bookings uh to see how we can extract the the year so in the previous exercise we we saw the date trunk function and we could apply it here as well so we could say date trunk start time um Year from start time right because we want to see it at the Year resolution and then we will get something like this and then we could check that this is equal to 2012 0101 and this would actually work but there’s actually a better way to do it what we could do here is that we could say extract Year from start time and when we look at here we got a integer that actually represents the year and it will be easy now to just say equal to 2012 and make that test so if we look at what happened here extract is a different function than date time because extract is getting the year and outputting it as an integer whereas date time is still outputting a time stamp or a date just with lower granularity so you have to use one or another according to your needs now to proceed with our query we can get CD bookings and add a filter here and insert this expression in the filter and we want the year to be 2012 so this will take care of isolating our desired time period next we want to check the total slots within groups defined by facility fac ID and month so we want a total for each facility for each month as you can see here in the respected results such that we can say that for facility ID zero in the month of July in the year 2012 we uh booked to 170 slots so let’s see how we can do that this basically means that we have to group by multiple values right and facility ID is easy we have it however we do not have the month so how can we extract the month from the start time over here well we can use the extract function right which is which we just saw so if we write it like this and we put month here um this function will look at the month and then we’ll output the month as an actual integer and um the thing is that I can Group by uh the names of columns but I can also Group by Transformations on columns it works just as well SQL will compute uh this expression over here and then it will get the value and then it will Group by that value now when it comes to getting the columns what I usually do is that when I group by I want to see the The Columns in which I grouped so I just copy what I had here and I add it to my query and then what aggregation do I want to do within the groups defined by these two columns I have seen it in the previous exercise I want to sum over the the slots and get the total slots I also want to take this column over here and rename it as month and now I have to order by ID and month and we get the data that we needed so what did we learn with this exercise we learned to use the extract function to get a number out of a date and we use that we have used uh grouping by multiple columns which simply defines a group as the combination of the unique values of two or more columns that’s what multiple grouping does we have also seen that not only you can Group by providing a column name but you can also Group by a logical operation and you should then reference that same operation in the select statement so that you can get the uh value that was obtained find the count of members who have made at least one booking so where is the data that we need it’s in the bookings table and for every booking we have the ID of the member who has made the booking so I can select this column and clearly I can run a count on this column and the count will return the number of nonnull values however this count as you can see is quite inflated What’s Happening Here is that uh a single member can make any number of bookings and now we’re basically counting all the bookings in here but if I put distinct in here then I’m only going to count the unique values of mid in my booking table and this give me gives me the total number of members who have made at least one booking so count will get you the count of non-null values and count distinct will get you the count of unique nonnull values list the facilities with more than 1,000 slots booked so what do we need to do here we need to look at each facility and how many slots they each booked so where is the data for this as you can see again the data is in the bookings table now I don’t need to do any filter so I don’t need the wear statement but I need to count the total slots within each facility so I need a group pi and I can Group by the facility ID and once I do that I can select the facility ID and to get the total slots I can simply do sum of slots and I can call this total slots it’s double quotes for a column name now I need to add the filter I want to keep those that have some of slots bigger than 1,000 and I cannot do it in a where statement right so if I were to write this in a where statement I would get that aggregate functions are not allowed in wear and if I look at my map uh we have been through this again the wear runs first right after we Source the data whereas aggregations happens happen later so the wear cannot be aware of any aggregations that I’ve done for this purpose we actually have the having component so the having component works just like wear it’s a filter it drops rows based on logical conditions the difference is that having runs after the aggregations and it works on the aggregations so I get the data do my first filtering then do the grouping compute an aggregation and then I can filter it again based on the result of the aggregation so I can now now go to my query and take this and put having instead of where and place it after the group pi and we get our result and all we need to do is to order bu facility ID and we get our result find the total revenue of each facility so we want a list of facilities by name along with their total revenue first question as always where is my data so if I want facility’s name it’s in the facilities table but to calculate the revenue I need to know about the bookings so I’ll actually need to join on both of these tables so I will write from CD bookings book join CD facilities fact on facility ID next I will want the total revenue of the facilities but I don’t even have the revenue yet so my first priority should be to compute the revenue let us first select the facility name and here I will now need to add the revenue so to do that I will need to have something like cost times slots and that determines the revenue of each booking however I don’t have a single value for cost I have two values member cost and guest cost and as you remember from previous exercises I need to choose every time which of them to apply and the way that I can choose is by looking at the member ID and if it’s zero then I need to use the guest cost otherwise I need to use the member cost so what can we use now in order to choose between these two variants for each booking we can use the case statement for this so I will say case and then immediately close it with end and I’ll say when uh book M ID equals zero then Fox guest cost I always need to reference the parent Table after a join to avoid confusion else fax member cost so this will allow me to get the C cost dynamically it allows me to choose between two columns and I can multiply this by slots and get the revenue now if I run this I get this result which is the name of the facility and the revenue but I need to ask myself at what level am I working here in other words what does each row represent well I haven’t grouped yet so each row here represents a single booking having joined bookings and facilities and not having grouped anything we are still at the level of this table where every row represent a single booking so to find the total revenue for each facility I now need to do an aggregation I need to group by facility name and then sum all all the revenue I can actually do this within the same query by saying Group by facility name and if I run this I will now get an error can you figure out why I’m getting this error now so I have grouped by facility name and then I’m selecting by facility name and that works well because now this column has been squished has been compressed to show only the unique names for each facility however I am then adding another column which is revenue which I have not compressed in any way therefore this column has a different number of rows than than this column and the general rule of grouping is that after I group by one or more columns I can select by The Columns that are in the grouping and aggregations right so nothing else is allowed so fax name is good because it’s in the grouping revenue is not good because it’s not in the grouping and it’s not an aggregation and to solve this I can simply turn it into an aggregation by doing sum over here and when I run this this actually works and now all I need to do is to sort by Revenue so if I say order by Revenue I will get the result that I need so there’s a few things going on here but I can understand it by looking at my map now what I’m doing is that I’m first sourcing the data and I’m actually joining two tables in order to create a new table where my data is then I’m grouping by a c a column which is the facility name so this compresses the column to all the unique facility name and next I run the aggregation right so the aggregation can be a sum over an existing column but as we saw in the mental models course the aggregation can also be a sum over a calculation I can actually run logic in there it’s very flexible so if I had a revenue column here I would just say sum Revenue as revenue and it would be simpler but I need to do some to put some logic in there and uh this logic involves uh choosing whether to get guest cost or member cost but I’m perfectly able to put that logic inside the sum and so SQL will first evaluate this Logic for each row and then um it will sum up all the results and it will give me Revenue finally after Computing that aggregation I uh select the columns that I need and then I do an order buy at the end find facilities with a total revenue of less than 1,000 so the the question is pretty clear but wait a second we calculate ated the total revenue by facility in the previous exercise so we can probably just adapt that code here’s the code from the previous exercise so check that out if you want to know how I wrote this and if I run this code I do indeed get the total revenue for for each facility and now I just need to keep those with a revenue less than 1,000 so how can I do that it’s a filter right I need to filter on this Revenue column um I cannot use a wear filter because this uh revenue is an aggregation and it was computed after the group buy after the wear so the wear wouldn’t be aware of that uh column but as we have seen there is a keyword there is a statement called having which does the same job as where it filters based on logical conditions however it works on aggregations so I could say having Revenue smaller than 1,000 unfortunately this doesn’t work can you figure out why this doesn’t work in our query we do a grouping and then we compute an aggregation and then we give it a label and then we try to run a having filter on this label if you look now at our map for The Logical order of SQL operations this is where the group by happens this is where we compute our aggregation and this is where having runs and now having is trying to use the Alias that comes at this step but according to our rules having does not know of the Alias that’s assigned at this step because it hasn’t happened yet now as the discussion for this exercise says there are in fact database systems that try to make your life easier by allowing you to use labels in having but that’s not the case with postgress so we need a slightly different solution here note that if I repeated all of my logic in here instead of using the label it would work so if I do this I will get my result I just need to order by Revenue and you see that I get the correct result why does it work when I put the whole logic in there instead of using the label once again the logic happens here and so the having is aware of this logic having happened but the having is just not aware of the Alias however I do not recommend repeating logic like this in your queries because it increases the chances of errors and it also makes them less elegant less readable so the simpler solution we can do here is to take this original query and put it in round brackets and then create a virtual table using a Common Table expression here and call this all of these T1 and then we can treat T1 like any other table so I can say from T1 select everything where revenue is smaller than 1,000 and then order by Revenue remove all this and we get the correct answer to summarize you can use having to filter on the result of aggregation ations unfortunately in postest you cannot use the labels that you assign to aggregations in having so if it’s a really small aggregation like if it’s select some revenue and then all of the rest then it’s fine to say sum Revenue smaller than 1,000 there’s a small repetition but it’s not an issue however if your aggregation is more complex as in this case you don’t really want to repeat it and then your forced to add an extra step to your query which you can do with a common table expression output the facility ID that has the highest number of slots booked so first of all we need to get the number of slots booked by facility and we’ve actually done it before but let’s do it again where is our data the data is in the booking table and uh we don’t need to filter this table but we need we do need to group by the facility ID and then once we do this we can select the facility ID this will isolate all the unique values of this column and within each unique value we can sum the number of slots and call this total slots and if we do this we get the total slots for each facility now to get the top one the quickest solution really would be to order by total slots and then limit the result to one however this would give me the one with the smallest number of slots because order is ascending by default so I need to turn this into descending and here I would get my solution but given that this is a simple solution and it solved our exercise can you imagine a situation in which this query would not achieve what we wanted it to let us say that there were multiple facilities that had the top number of total slots so the top number of slots in our data set is 1404 that’s all good but let’s say that there were two facilities that had this uh this top number and we wanted to see both of them for our business purposes what would happen here is that limit one so the everything else would work correctly and the ordering would work correctly but inevitably in the ordering one of them would get the first spot and the other would get the second spot and limit one is always cutting the output to a single row therefore in this query we would only ever see one facility ID even if there were more that had the same number of top slots so how can we solve this clearly in instead of combining order by and limit we need to figure out a filter we need to filter our table such that only the facilities with the top number of slots are returned but we cannot really get the maximum of some slots in this query because if I tried to do having some slots equals maximum of some slots I would be told that aggregate function calls cannot be nested and if I go back to my map I can see that having can only run after all the aggregations have completed but what we’re trying to do here is to add a new aggregation inside having and that basically doesn’t work so the simplest solution here is to just wrap all of this into a Common Table expression and then get this uh table that we’ve just defined and then select star where the total slots is equal to the maximum number of slots which we know to be 1404 however we cannot hardcode the maximum number of slots because for one we might not know what it is and for and second it uh it will change with time so this won’t work when the data changes so what’s the alternative to hardcoding this we actually need some logic here to get the maximum value and we can put that logic inside the subquery and the subquery will go back to my table T1 and you will actually find the maximum of total slots from T1 so first this query will run it will get the maximum and then the filter will check for that maximum and then I will get uh the required result and this won’t break if there are many facilities that share the same top spot because we’re using a filter all of them will be returned so this is a perfectly good solution for your information you can also solve this with a window function and um which is a sort of row level aggregation that doesn’t change the structure of the data we’ve seen it in detail in the mental models course so what I can do here is to use a window function to get the maximum value over the sum of slots and then I can I will say over to make it clear that this is a window function but I won’t put anything in the window definition because I I just want to look at my whole data set here and I can label this Max slots and if I look at the data here you can see that I will get the maximum for every row and then to get the correct result I can add a simple filter here saying that total slots should be equal to Max slots and I will only want to return facility ID and total slots so this also solves the problem what’s interesting to note here for the sake of understanding window functions more deeply is that the aggregation function for this uh window Clause works over an aggregation as well so here we sum the total slots over each facility and then the window function gets the maximum of all of those uh value and this is quite a powerful feature um and if I look at my map over here I can see that it makes perfect sense because here is where we Group by facility ID and here is where we compute the aggregation and then the window comes later so the window is aware of the aggregation and the window can work on on that so A few different solutions here and overall um a really interesting exercise list the total slots booked per facility per month part two so this is a bit of a complex query but the easiest way to get it is to look at the expected results so what we see here is a facility ID and then within each month of the year 2012 we get the total number of slots and um at the end of it we have a null value here and for facility zero and what we get is the sum of all slots booked in 2012 and then the same pattern repeats repeats with every facility we have the total within each month and then finally we have the total for that facility in the year here so there’s two level of aggregations here and then if I go at the end there’s a third level of aggregation which is the total for all facilities within that year so there are three levels of aggregation here by increasing granularity it’s total over the year then total by facility over the year and then finally total by Facility by month within that year so this is a bit breaking the mold of what SQL usually does in the sense that SQL is not designed to return a single result with multiple levels of aggregation so we will need to be a bit creative around that but let us start now with the lowest level of granularity let’s get this uh this part right facility ID and month and and then we’ll build on top of that so the table that I need is in the bookings table and first question do I need to filter this table yes because I’m only interested in the year 2012 so we have seen that we can use the extract function to get the year out of a Tim stamp which would be start time and we can use this function in a wear filter and what this function will do is that it will go to that time stamp and then we will get an integer out of it it will get a number and then we can check that this is uh the year that we’re interested in and let’s do a quick sanity check to make sure this worked so I will get some bookings here and they will all be in the year 2012 next I need to Define my grouping right so I will need to group by facility ID but then I will also need to group by month however I don’t actually have a column named uh month in this table so I need to calculate it I can calculate it once again with the extract function so I can say extract extract month from start time and once again this will go to the start time and sped out a integer which for this first row would be seven and uh as you know in the group bu I can select a column but I can also select an operation over a column which works just as well now after grouping I cannot do select star anymore but I want to see The Columns that I have grouped by and so let us do a quick sanity check on that it looks pretty good I get the facility ID and the month and I can actually label this month and next I simply need to take the sum over the slots within each facility and within each month and when I look at this I have my first level of granularity and you can see that the first row corresponds to the expected result now I need to add the next level of granularity which is the total within each facility so can you think of how can I add that next level of granularity to my results the key Insight is to look at this uh expected results table and to see it as multiple tables stacked on top of each other one table is the one that we have here and this is uh total by facility month a second table that we will need is the total by facility and then the third table that we will need is the overall total which you could see here at the bottom and how can we stack multiple tables on top of each each other with a union statement right Union will stack all the rows from my tables on top of each other so now let us compute the table which has the total by facility and I will actually copy paste what I have here and and I just need to remove a level of grouping right so if I do this I I will not Group by month anymore and I will not Group by month anymore and once I do this I get an error Union query must have the same number of columns so do you understand this error here so I will write a bit to show you what’s happening so how does it work when we Union two tables let’s say the first table in our case is facility ID month and then slots and then the second table if you look here it’s facility ID and then slots now when you Union these two tables SQL assumes that you have the same number of columns and that the ordering is also identical so here we are failing because the first table has three columns and the second table has only two and not only We are failing because there’s a numbers mismatch but we are also mixing the values of month and Slots now this might work because they’re both integers so SQL won’t necessarily complain about this but it is logically wrong so what we need to do is to make sure that when we’re unioning these two tables we have the same number of columns and the same ordering as well but how can we do this given that the second table does indeed have one column less it does have less information so what I can do is to put null over here so what happens if I do select null this will create a column of a of constant value which is a column of all NS and then the structure will become like this now when I Union first of all I’m going to have the same number of columns so I’m not going to see this uh this error again that we have here and second in u the facility ID is going to be mixed with the facility ID slots is going to be mixed with slots which is all good and then month is going to be mixed with null which is what we want because in some cases we will have the actual month and in some cases we won’t have any anything so I have added uh null over here and I am unioning the tables and if I run the query I can see that I don’t get any error anymore and this is what I want so I can tell that this row is coming from the second table because it has null in the value of month and so it’s showing the total slots for facility um zero in every month whereas this row came from the upper table because it’s showing the sum of slots for a facility within a certain month so this achieves the desired result next we want to compute the last level of granularity which is the total so once again I will select my query over here and and I don’t even need to group by anymore right because it’s the total number of slots over the whole year so I can simply say sum of slots as slots and remove the grouping next I can add the Union as well so that that I can keep stacking these tables and if I run this I get the same error as before so going back to our little uh text over here we are now adding a third table and this table only has slots and of course I cannot this doesn’t work because there’s a mismatch in the number of columns and so the solution here is to also add a null column here and a null column here and so I have the same number of columns and Slots gets combined with slots and everything else gets filled with null values and I can do it here making sure that the ordering is correct so I will select null null and then sum of slots and if I run this query I can see that the result works the final step is to add ordering sorted by ID and month so at the end of all of these unions I can say order by facility ID one and I finally get my result so this is now the combination of three different tables stacked on top of each other that show different levels of granularity and as you can see here in the schema we added null columns to uh two of these tables just to make sure that they have the same number of columns and that they can stack up correctly and now if we look again at the whole query we can see that there are actually three select statements in this query meaning three tables which are calculated and then finally stack with Union and all of them they do some pretty straightforward aggregation the first one um Aggregates by facility ad and month after extracting the month the second one simply Aggregates by facility ID and the third one gets the sum of slots over the whole data without any grouping and then we are adding the null uh constant columns here to make the the column count [Music] match and it’s also worth it to see this in our map of the SQL operations so here um you can see that this order is actually repeating for every table so for each of our three tables we are getting our data and then we are running a filter to keep the year 2012 and then we do a grouping and compute an aggregation and select the columns that we need adding null columns when necessary and then it repeats all over right so for the second table again the same process for the third table the same process except that in the third table we don’t Group by and then when all three tables are done the union r runs the union runs and stacks them all up together and now instead of three tables I only have one table and after the union has run now I can finally order my table and return the result list the total hours booked per named facility so we want to get the facility ID and the facility name and the total hours that they’ve been booked keeping keeping in mind that what we have here are number of slots for each booking and a slot represents 30 minutes of booking now to get my data I will need both the booking table and the facilities table because I need both the information on the bookings and the facility name so I will get the bookings table and the facilities table and join them together next I don’t really need to filter on anything but I need to group by facility so I will Group by facility ID and then I also need to group by facility name otherwise I won’t be able to use this in the select part and now I can select these two columns and to get the total hours I will need to get the sum of the slots so I can get the total number of slots within each facility and I will need to divide this by two right so let’s see what that looks like now superficially this looks correct but there’s actually a pitfall in here and to realize a pitfall I will take some slots as well before dividing it by two and you can see it already in the first row 9911 ided by 2 is not quite 455 so what is happening here the thing is that in postgress when you take an integer number such as some slots the sum of the slots is an integer number and you divide by another integer postgress assumes that you you are doing integer Division and since you are dividing two integers it returns uh an integer as well so that means that um that the solution is not exact if you are thinking in floating Point numbers and the solution for this is that at least one of the two numbers needs to be a Flo floating Point number and so we can turn two into 2.0 and if I run this I now get the correct result so it’s important to be careful with integer division in postest it is a potential Pitfall now what I need to do is to reduce the number of zeros after the comma so I need some sort of rounding and for this I can use the round function which looks like this and this is a typical function in uh in SQL and how it works is that it takes two arguments the first argument is a column and actually this is the column right this whole operation and then the second argument is how many uh figures do you want to see after the zero after the comma sorry so now I can clean this up a bit label this as total hours and then I will need to order by facility ID and I get my result so nothing crazy here really we Source our data from a join which is this part over here and then we Group by two columns we select those columns and U then we sum over the slots divide making sure to not have integer division so we use one of the numbers becomes a floating Point number and we round the result of this column list each Member’s First booking after September 1st 2012 so in order to get our data where does our data leave we need the data about the member and we also need data about their bookings so the data is actually in the members and bookings table so I will quickly join on these [Music] tables and we now have our data do we need a filter on our data yes because we only want to look after September 1st 2012 so we can say where start time is bigger than and it should be enough to just provide the date like this now in the result we need the members surname and first name and their memory ID and then we get to we need to see the first booking in our data meaning the earliest time so again we have an aggregation here so in order to implement this aggregation I need to group by all of these columns that I want to call so surname first name and member ID now that I have grouped by this columns I can select them so now I am I have grouped by each member and now I have all the dates for all their bookings after September 1st 2012 and now how can I look at all these dates and get the earliest date what type of aggregation do I need to use I can use the mean aggregation which will look at all of the dates and then compress them to a single date which is the smallest date and I can call this start time finally I need to order by member ID and I get the result that I needed so this is actually quite straightforward I get my data by joining two tables I make sure I only have the data that I need by filtering on the on the time period and then I group by all the information that I want to see for each member and then within each member I use mean to get the smallest date meaning the earliest date now I wanted to give you a bit of an insight into the subtleties of how SQL Compares timestamps and dates because the results here can be a bit surprising so I wrote three logical Expressions here for you and your job is to try to guess if either of these three Expressions will be true or false so take a look at them and try to answer that as you can see what we have here is a time stamp uh that indicates the 1st of September 8:00 whereas here we have uh simply the indication of the date the 1st of September and the values are the same in all three but my question is are they equal is this uh greater or is this smaller so what do you think I think the intuitive answer is to say that in the first case we have September 1st on one side September 1st on the other they are the same day so this ought to be true whereas here we have again the same day on both sides so this is not strictly bigger than the other one so this should be false and it is also not strictly smaller so this would be false as well now let’s run the query and see what’s actually happening right so what we see here is that we thought this would be true but it’s actually false we thought this would be false but it’s actually uh true and this one is indeed false so are you surprised by this result or is it what you expected if you are surprised can you figure out what’s going on here now what is happening here is that you are running a comparison between two expressions which have a different level of granularity the one on the left is showing you day hour minute seconds and the one on the right is showing you the date only in other words the value on the left is a Tim stamp whereas the value on the right is a date so different levels of precision here now to make the comparison work SQL needs to convert one into the other it needs to do something that is known technically as implicit type coercion what does it mean type is the data type right so either time stamp or date type coercion is when you take a value and you convert it to a different type and it’s implicit uh because we haven’t ask for it and SQL has to do it on its own behind the scenes and so how does SQL choose which one to convert to the other the choice is based on let’s keep the one with the highest precision and convert the other so we have the time stamp with the higher Precision on the left and we need to convert the date into the timestamp this is how SQL is going to handle this situation it’s going to favor the one with the highest Precision now in order to convert a date to a time stamp what SQL will do is that it will add all zeros here so this will basically represent the very first second of the day of uh September 1st 2012 now we can verify which I just showed you I’m going to comment this line and I’m going to add another logical expression here which is taking the Tim stamp representing all zeros here and then setting it equal to the date right here so what do we we expect to happen now we have two different types there will be a type coercion and then SQL will take this value on the right and turn it into exactly this value on the left therefore after I check whether they’re equal I should get true here turns out that this is true but I need to add another step which is to convert this to a Tim stamp and after I do this I get what I expected which is that this comparison here is true so what this notation does in postest is that it does the type coercion taking this date and forcing it into a time stamp and I’ll be honest with you I don’t understand exactly why I need to to do this here I thought that this would work simply by taking this part over here but u i I also need to somehow explicitly tell SQL that I want this to be a time stamp nonetheless this is the Insight that we needed here and it allows us to understand why this comparison is actually false because we are comparing a time stamp for the very first second of September 1st with a time stamp that is the first second of the eighth hour of September 1st and so it fails and we can also see why on on this line the left side is bigger than the right hand side and uh and this one did not actually fool us so we’re good with that so long story short if you’re just getting started you might not know that SQL does this uh implicit type coercion in the background and this dates comparison might leave you quite confused now I’ve cleaned the code up a bit and now the question is what do we need to do with the code in order to match our initial intuition so what do we need to do such that this line is true and the second line is false and this one is still false so we don’t have to worry about it well since the implicit coercion turns the date into a time stamp we actually want to do the opposite we want to turn the time stamp into a date so it will be enough to do the type coion ourselves and transform this into dates like this and when I run this new query I get exactly what I expected so now I’m comparing at the level of precision or granularity that I wanted I’m only looking at the at the date so I hope this wasn’t too confusing I hope it was a bit insightful and that you have a new appreciation for the complexities that can arise when you work with dates and time stamps in SQL produce a list of member names with each row containing the total member count let’s look at the expected results we have the first name and the surname for each member and then every single row shows the total count of members there are 31 members in our table now if I want to get the total count of members I can take the members table and then select the count and this will give me 31 right but I cannot add first name and surname to this I will get uh an error because count star is an aggregation and it takes all the 31 rows and produces a single number which is 31 while I’m not aggregating first name and surname so the standard aggregation doesn’t work here I need an aggregation that doesn’t change the structure of my table and that works at the level of the row and to have an aggregation that works at the level of the row I can use a window function and the window function looks like having an aggregation followed by the keyword over and then the definition of the window so if I do this I get the count at the level of the row and to respect the results I need to change the order a bit here and I get the result that I wanted so a window function has these two main components an aggregation and a window definition in this case the aggregation counts the number of rows and the window definition is empty meaning that our window is the entire table and so this aggregation will be computed over the entire table and then added to each row there are far more details about the window functions and how they work in my mental model course produce a numbered list of members ordered by their date of joining so I will take the members table and I will select the first name and surname and to to produce a numbered list I can use a window function with the row number aggregation so I’ll say row number over so row number is a special aggregation that works only for window functions and what it does is that it numbers the rows um monotonically giving a number to each starting from one and going uh forward and it never assigns the same number to two rows and in the window you need to define the ordering uh for for the numbering so what is the ordering in this case it’s um defined by the join date and by default it’s ascending so that’s good and we can call this row number and we get the results we wanted and again you can find a longer explanation for this with much more detail about the window functions and and row number in the mental models course output the facility ID that has the highest number of slots booked again so we’ve we’ve already solved this problem in a few different ways let’s see a new way to to solve it so we can go to our bookings table and we can Group by facility ID and then we can get the facility ID in our select and then we could sum on slots to get the total slots booked for each facility and since we’re dealing with window functions we can also rank facilities based on the total slots that they have booked and this would look like rank over order by some slots descending and we can call this RK for Rank and if I order by some slots uh descending I should see that my rank works as intended so we’ve seen this in the mental models course you can think of rank as U deciding the outcome of a Race So the person who did the most in this case gets ranked one and then everyone else gets rank two 3 four but if there were two um candidates that got the same score the highest score they would both get rank one because they would both have won the race so to speak and the rank here is defined over the window of the sum of slots descending so that is what we need and next to get all the facilities that have the highest score or we could wrap this into a Common Table expression and then take that table and then select the facility ID and we can label this column total then we will get total and filter for where ranking is equal to one and we get our result aside from how rank works the the other thing to note in this exercise is that we can Define the window based on an aggregation so in this case we are ordering the elements of our window based on the sum of slots and if we look at our map over here we can see that uh we get the data we have our group ey we have the aggregation and then we have the window so the window follows the aggregation and So based on our rules the window has access to the aggregation and it’s able to use it rank members by rounded hours used so the expected results are quite straightforward we have the first name and the surname of each member we have the total hours that they have used and then we are ranking them based on that so the information for this result where is it uh we can see that it’s in the members and bookings tables and so we will need to join on these two tables members Ms join bookings book on M ID and that’s our join now we need to get the total hours so we can Group by our first name and we also need to group by the surname because we will want to display it and now we can select these two columns and we need to compute the total hours so how can we get that for each member we know the slots that they got uh at every booking so we need to get all those those uh slots sum them up and uh every slot represents a 30 minute interval right so to get the hours we need to divide this value by two and remember if I take an integer like sum of slots and divide by two which is also an integer I’m going to have integer division so I won’t have the part after the comma in the result of the division and that’s not what I want so instead of saying divide by two I will say divide by 2.0 so let’s check um how the data looks like this is looking good now but um if we read the question we want to round to the nearest 10 hours so 19 should probably be 20 115 should probably be 120 because I think that we round up when we have 15 and so on as you can see here in the result so how can we do this rounding well we have a Nifty round function which as the first argument takes the column with all the values and the second argument we can specify how do we want the rounding and to round to the nearest 10 you can put -1 here so actually let’s keep displaying the the total hours as well as the rounded value to make sure that we’re doing it correctly so as you can see we are indeed um rounding to to the nearest 10 so this is looking good and for the to understand the reason why I used minus one here and how the rounding function works I will have a small section about it when we’re done with this exercise but meanwhile Let’s uh finish this exercise so now I want to rank all of my rows based on this value here that I have comped computed and since this is an aggregation it will already be available to a window function right because in The Logical order of operations aggregation happen here and then Windows happen afterward and they have access to the data uh from the aggregation so it should be possible to transform this into a window function so think for a moment uh of how we could do that so window function has its own aggregation which in this case is a simple Rank and then we have the over part which defines the window and what do we want to put in our window in this case we want to order by let’s say our um rounded hours and we want to order descending because we want the guest the member with the high hours to have the best rank but uh clearly we don’t have a column called rounded hours what we have here is this logic over here so I will substitute this name with my actual logic and I will get my actual Rank and now I can delete this column here that I was was just looking at and I can sort by rank surname first name small error here I actually do need to show the hour as well so I need to take this logic over here again and call this ours and I finally get my result so to summarize what we are doing in this exercise we’re getting our data by joining these two tables and then we’re grouping by the first name and the surname of the member and then we are summing over the slots for each member dividing by 2.0 to make sure we have an exact Division and uh using the rounding function to round down to the nearest hour and so we get the hours and we use the same logic inside a window function to have a ranking such that the members with the with most hours get rank of one and then the one with the second most hours get rank of two and so on as you can see here in the result and I am perfectly able to use use this logic to Define The Ordering of my window because window functions can use uh aggregations as seen in The Logical order of SQL operations here because window functions occur after aggregations and um and that’s it then we just order by the required values and get our results now here’s a brief overview of how rounding Works in SQL now rounding is a function that takes a certain number and then returns an approximation of that number which is usually easier to parse and easier to read and you have the round function and it works like this the first argument is a value and it can be a constant as in this case so we just have a number or it can be a column um in which case it will apply the round function to every element of the column and the second argument specifies how we want the rounding to occur so here you can see the number from which we start and the first rounding we apply has an argument of two so this means that we really just want to see two uh numbers after the decimal so this is what the first rounding does as you can see here and we we round down or up based on whether the values are equal or greater than five in which case we round up or smaller than five in which case we round down so in this first example two is lesser than five so we just get rid of it and then we have eight eight is greater than five so we have to round up and so when we round up this 79 becomes an 80 and this is how we get to this first round over here here then we have round with an argument of one which leaves one place after the decimal and which is this result over here and then we have round without any argument which is actually the same as providing an argument of zero which means that we really just want to see the whole number and then what’s interesting to note is that the rounding function can be generalized to continue even after we got rid of all the decimal part by providing negative arguments so round with the argument of-1 really means that I want to round uh round this number to the nearest 10 so you can see here that from 48,2 192 we end up at 48,2 190 going to the nearest 10 rounding with a value of -2 means going to the nearest 100 so uh 290 the nearest 100 is 300 right so we have to round up and so we get this minus 3 means uh round to the nearest thousand so if you look at here we have 48,3 and so the nearest thousand to that is 48,000 minus 4 means the nearest 10,000 ,000 so given that we have 48,000 the nearest 10,000 is 50,000 and finally round minus 5 means round to the nearest 100,000 and um the given that we have 48,000 the nearest 100,000 is actually zero and from now on as we keep going negatively we will always get zero on this number so this is how rounding Works in brief it’s a pretty useful function not everyone knows that you can provide it negative arguments actually I didn’t know and then when I did the first version of this course um commenter pointed it out so shout out to him U don’t know if he wants me to say his name but hopefully now you understand how rounding works and you can use it in your problems find the top three Revenue generating facilities so we want a list of the facilities that have the top three revenues including ties this is important and if you look at the expected results we simply have a the facility name and a bit of a giveaway of what we will need to use the rank of these facilities so there’s this other exercise that we did a while back which is find the total revenue of each facility and from this exercise I have taken the code that uh allows us to get to this point where we see the name of the facility and the total revenue for that facility and you can go back there to that exercise to see in detail how this code works but in brief we are joining the bookings and Facilities tables and we are grouping by facility name and then we are getting that facility name and then within each booking we are Computing the revenue by taking the slots and using a case when to choose whether to use guest cost or member cost and so this is how we get the revenue for each single booking and now given that we grouped by facility we can sum all of these revenues to get the total revenue of each facility and this is how we get to this point given this partial result all that’s left to do now is to rank these facilities based on their revenue so what I need here is a window function that will allow me to implement this ranking and this window function would look something like this I have a rank and why is rank the right function even though they sort of uh gave it away because if you want the facilities who have the top revenues including ties you can think of it as a race all facilities are racing to get to the top revenue and then if two or three or four facilities get that top Revenue if there are more in the top position you can’t arbitrarily say oh you are first and they are second second you have to give them all the rank one because you have to tell them um recognize that they are all first so these type of problems uh call for a ranking solution so our window function would use rank as the aggregation and then we need to Define our window and how do we Define our window we Define the ordering for the ranking here so we can say order by Revenue descending such that the high highest revenue will get rank one the next highest will get rank two and so on now this will not work because I don’t have the revenue column right I do have something here that is labeled as Revenue but the ranking part will not be aware of this label however I do have the logic to compute the revenue so I could take the logic right here and paste it over here and I will add a comma now this is not the most elegant looking code but let’s see if it works and we need to order by Revenue descending to see it in action and if I order by Revenue descending you can in fact see that the facility with the highest revenue gets rank one and then it goes down from there so now I just need to clean this up a bit first I will remove the revenue column and then I will remove the ordering and what I need here for the result is to keep only the facilities that have rank of three or smaller so ranks 1 2 three and there’s actually no way to do it in this query so I actually have to wrap this query into a common table expr expression and then take that table and say select star from T1 where rank is smaller or equal to three and I will need to order by rank ascending here and I get the result I needed so what happened here we built upon the logic of getting the total revenue for each facility and again we saw that in the previous exercise and um then what we did here is that we added a rank window function and within this rank we order by this total revenue so this might look a bit complex but you have to remember that when we have many operations that are nested you always start with the innermost operation and move your way up from there so the innermost operation is a case when which chooses between guest cost and member cost and then multiplies it by slots and this inner operation over here is calculating the revenue for each single booking the next operation is an aggregation that takes that revenue for each single booking and sums this these revenues up to get the total revenue by each facility and finally the outermost operation is taking the total revenue for each facility and it’s ordering them in descending order in order to figure out the ranking and the reason all of this works we can go back to our map of SQL operations you can see here that after getting the table the first thing that happens here is the group buy and then the aggregations and here is where we sum over the total of of Revenue and after the aggregation is completed we have the window function so the window function has access to the aggregation and can use them when defining the window and finally after we get the ranking we we have no way of isolating only the first three ranks in this query so we need to do it with a common table expression and if you look here back to our map this makes sense because what components do we have in order to filter our table in order to only keep certain rows we have the wear which happens here very early and we have the having and they both happen before the window so after the window function you actually don’t have another filter so you need to use a Common Table expression classify facilities by value so we want to classify facilities into equally sized groups of high average and low based on their revenue and the result you can see it here for each facility it’s classified as high average or low and the point is that we decid decided uh at the beginning that we want three groups and this is arbitrary we could have said we want two groups or five or six or seven and then but we have three and then all the facilities that we have are distributed equally Within These groups so because we have nine facilities we get uh three facilities within each group and I can already tell you that there is a spe special function that will do this for us so we will not go through the trouble of implementing this manually which could be pretty complex so I have copied here the code that allow allows me to get the total revenue for each facility and we have seen this code more than one time in past exercises so if you’re still not clear about how we can get to this point uh check out the the previous exercises so what we did in the previous exercise was rank the facilities based on the revenue and how we did that is that we took the ranking window function and then we def defined our window as order by Revenue descending except that we don’t have a revenue column here but we do have the logic to compute the revenue so we can just get this logic and paste it in here and when I run this I will get a rank for each of my facilities where the biggest Revenue gets rank one and then it goes up from there now the whole trick to solve this exercise is to replace the rank aggregation with a antile aggregation and provide here the number of groups in which we want to divide our facilities and if I run this you see that I get what I need the facilities have been equally distributed into three groups where group number one has the facilities with the highest revenue and then we have group number two and finally group number three which has the facilities with the lowest revenue and to see how this function works I will simply go to Google and say postest antile and the second link here is the postest documentation and this is the page for window functions so if I scroll down here I can see all of the functions that I can use in window functions and you will recognize some of our old friends here row number rank dance rank uh and here we have antile and what we see here is that antile returns an integer ranging from one to the argument value and the argument value is what we have here which is the number of buckets dividing the partition as equally as possible so we call the enti function and we provide how many buckets we want to divide our data into and then the function divides the data as equally as possible into our buckets and how will this division take place that depends on the definition of the window in this case we are ordering by Revenue descending and so this is how the ntile function works so we just need to clean this up a bit I will remove the revenue part because that’s not required from us and I will call this uh enti quite simply and now I need to add a label on top of this enti value as you can see in the results so to do that I will wrap this into a Common Table expression and when I have a common table expression I don’t need the ordering anymore and then I can select from the table that I have just defined and what what do I want to get from this table I want to get the name of the facility and then I want to get the enti value with a label on top of it so I will use a case when statement to assign this label so case when NTI equals 1 then I will have high when anti equals 2 then I will have average else I will have low uh and the case and call this revenue and finally I want to order by antile so the results are first showing High then average then low and also by facility name and I get the result that I wanted so to summarize uh this is just like the previous exercise except that we use a different window function because instead of rank we use end tile so that we can pocket our data and in the window like we said in the previous exercise there’s a few nested operations and you can figure it out by going to the deepest one and moving upwards so the first one picks up the guest cost or member cost multiplies it by slots gets the revenue Vue for each single booking the next one Aggregates on top of this within each facility so we get the total revenue by facility and then we use this we order by Revenue descending this defines our window and this is what the bucketing system uses to distribute the facilities uh in each bucket based on their revenue and then finally we need to add another layer of logic uh here we need to use a common table expression so that we can label our our percentile with the required um text labels calculate the payback time for each facility so this requires some understanding of the business reality that this data represents so if we look at the facilities table we have an initial outlay which represents the initial investment that was put into getting this facility and then we also have a value of monthly maintenance which is what we pay each month to keep this facility running and of course we will also have a value of monthly revenue for each facility so how can we calculate the amount of time that each facility will take to repay its cost of ownership let’s actually write it down so we don’t lose track of it we can get the monthly revenue of each facility but what we’re actually interested in is the monthly profit right um and to get the profit we can subtract the monthly maintenance for each facility so Revenue minus expenses equals profit and when we know how much profit we make for the facility each month we can take the initial investment and divided by the monthly profit and then we can see how many months it will take to repay the initial investment so let us do that now and what I have done here once again I copied the code to calculate the total revenue for each facility and um we have seen this in the previous exercises so you can check those out if you still have some questions about this and now that we have the total revenue for each facility we know that we have three complete months of data so far so how do we get to this to the monthly Revenue it’s as simple as dividing all of this by three and I will say 3.0 so we don’t have integer division but we have proper division you know and I can call this monthly revenue and now the revenue column does not exist anymore so I can remove the order buy and here I can see the monthly revenue for each facility and now from the monthly revenue for each facility I can subtract the monthly maintenance and this will give me the monthly profit but now we get this error and can you figure out what this is about monthly maintenance does not appear in the group by Clause so what we did here is that we grouped by facility name and then we selected that which is fine and all the rest was gation so remember as a rule when you Group by you can only select the columns that you have grouped by and aggregations and monthly maintenance uh is not an aggregation so in order to make it work we need to add it to the group by statement over here and now I get the monthly profit and finally the last step that I need to take is to take the initial outlay and divide it by by all of the rest that we have computed until now and we can call this months because this will give us the number of months that we need in order to repay our initial investment and again we get the same issue initial outlay is not an aggregation does not appear in the group by clause and easy solution we can just add it to the group by clause so something is pretty wrong here the values look pretty weird so looking at all this calculation that we have done until now can you figure out why the value is wrong the issue here is related to the order of operations because we have no round brackets here the order of operation will be the following initial outlay will be divided by the total revenue then it will be divided by 3.0 and then out of all of these we will subtract monthly maintenance but that’s not what we want to do right what we want to do is to take initial outlay and divide it by everything else which is the profit so I will add round brackets here and here and now we get something that makes much more sense because first we execute everything that’s Within These round brackets and we get the monthly profit and then all of it we divide initial outlay by and then what we want to do is to order by facility name so I will add it here and we get the result so quite a representative business problem calculating a revenue and profits and time to repay initial investment and uh overall is just a bunch of calculations starting from the group bu that allows us to get the total revenue for each booking we sum those revenues to get the total revenue for each facility divide by three to get the monthly Revenue subtract the monthly expenses to get the monthly profit and then take the initial investment and divide by the monthly profit and then we get the number of months that it will take to repay the facility calculate a rolling average of total revenue so for each day in August 2012 we want to see a rolling average of total revenue over the previous 15 days rolling averages are quite common in business analytics and how it works is that if you look at August 1st this value over here is the average of daily revenue for all facilities over the past 15 days including the day of August 1st and then this average is rolling by one win one day or sliding by one day every time so that the next average is the uh same one except it has shifted by one day because now it includes the 2nd of August so let’s see how to calculate this and in here I have basic code that calculates the revenue for each booking and I’ve taken this from previous exercises so if you have any questions uh check those out and what we have here is the name of each facility and um and the revenue for each booking so each row here represents just a single booking so this is what we had until now but if you think about it we’re not actually interested in seeing the name of the facility because we’re going to be uh summing up over all facilities we’re not interested in the revenue by each facility but we are interested in seeing the date in which each booking occurs because we want to aggregate within the date here so to get the date I can get the start part time field from bookings and because this is a time stamp so it shows hours minutes seconds I need to reduce it to a date and what I get here is that for again each row is a booking and for each booking I know the date on which it occurred and the revenue that it generated now for the next step I need to see the total revenue over each facility within the date right so this is a simple grouping so if I group by this calculation over here which gives me my date I can then get the date and now I have um I have compressed all the different occurrences of dates to Unique values right one row for every date and now I need to compress as well all these different revenues for each date to a single value and for that I can put this logic inside the sum aggregation as we have done before and this will give me the total revenue across all facilities for each day and we have it here for the next step my question for you is how can I see the global average over all revenues on each of these rows so that is a roow level aggregation that doesn’t change the structure of the table and that’s a window function right so I can have a window function here that gets the average of Revenue over and for now I can leave my window definition open because I will look at the whole table however um Revenue will not work because revenue is just a label that I’ve given on this column and but but this part here is not aware of the label I don’t actually have a revenue column at this point but instead of saying Revenue I could actually copy this logic over here and it would work because the window function occurs after Computing the aggregation so the window function is aware of it so this should work and now for every row I see the global average over all the revenues by day now for the next step I would like to first order by date ascending so we have it here in order and my next question of for you is how can we make this a cumulative average average so let’s say that our rows are already ordered by date and how can I get the average to grow by date so in the first case the average would be equal to the revenue because we only have one value on the second day the average would be the average of these two values so all the values we’ve seen until now on the third day it would be the average of the first three values and so on how can I do that the way that I can do that is that I can go to my window definition over here and I can add an ordering and I can order by date but of course the column date does not exist because that’s a label that will be assigned after all this part is done uh window function is not aware of labels but again window function works great with logic so I will take the logic and put it in here and now you can see that I get exactly what I wanted on the first row I get the average is equal to the revenue and then as it grows we only look at the current revenue and all the previous revenues to compute the average and but we don’t look at all of the revenues so on the second row uh we have the average between this Revenue over here and this one over here and then on the third row we have the average between these three revenues and so on now you will realize that we are almost done with our problem and the only piece that’s missing is that right now if I pick a random day within my data set say this one the the average here is computed over all the revenues from the previous days so all the days in my data that lead up to this one they get averaged and we compute this revenue and what I want to do to finish this problem is that instead of looking at all the days I only want to look 15 days back so I need to to reduce the maximum length that this window can extend in time from limited to 15 days back now here is where it gets interesting so what we need to do is to fine-tune the window definition in order to only look 15 days back and with window functions we do have the option to fine-tune the window and it turns out that there’s a another element to the window definition which is usually implicit it’s usually not written explicitly but it’s there in the background and it’s the rows part so I will now write rows between unbounded preceding and current rows row now what the rose part does is that it defines how far back the window can look and how far forward the the window can look and what we see in this command is actually the standard Behavior it’s the thing that happens by default which is why we usually don’t need to write it and what this means is that it says look as far back in the past as you can look as far back as you can based on the ordering and the current row so this is what we’ve been seeing until now and if I now run the query again after adding this part you will see that the values don’t change at all because this is what we have been doing until now so now instead of unbounded proceeding I want to look 14 rows back plus the current row which together makes 15 and if I run this my averages change because I’m now looking um I’m now averaging over the current row and the 14 previous rows so the last 15 values and now what’s left to do to match our result is to remove the actual Revenue over here and call this Revenue and finally we’re only interested in values for the month of August 2012 so we need to add a filter but we cannot add a filter in this table definition here because if we added a wear filter here um isolating the period for August 2012 can you see what the problem would be um if my data could only see um Revenue starting from the 1st of August he wouldn’t be able to compute the rolling average here because to get the rolling average for this value you need to look two weeks back and so you need to look into July so you need all the data to compute the rolling revenue and we must filter after getting our result so what that looks like is that we can wrap all of this into a Common Table expression and we can we won’t need the order within the Common Table expression anymore and then selecting this we can filter to make sure that the date fits in the required period so we could truncate this date at the month level and make sure that it is equal that the truncated value value is equal to the month of August and we have seen how day trunk works in the previous exercises and then we could select all of our columns and order by date I believe we may have an extra small error here because I kept the partial wear statement and if I run this I finally get the result that I wanted so a query that was a bit more complex it was the final boss of our exercises um so let’s summarize it we get the data we need by joining booking and facility um and then we are getting the revenue for each booking that is this um multiply slots by either guest cost or member cost cost depending on whether the member is a guest or not this is getting the revenue within each booking then we are grouping by date which you see uh over here and summing all of these revenues that we computed so that we get the total revenue within each day for all facilities then the total revenue for each day goes into a window function which computes an aggre ation at the level of each row and the window function computes the average for these total revenues within a specific window and the window is defines an ordering based on time so the the ordering based on date and the default behavior of the window would be to look at the average for the current day and all the days that precede up until the earliest date and we’re doing here is that we are fine-tuning the behavior of this function by saying hey don’t look all the way back in the past uh only look at 14 rows preceding plus the current row which means that given the time ordering we compute the average over the last 15 values of total revenue and then finally we wrap this in a Common Table expression and we filter so that we only see the rolling average for the month of August and we order by date and that were all the exercises that I wanted to do with you I hope you enjoyed it I hope you learned something new as you know there are more sections in here that go more into depth into date functions and string functions and how you can modify data I really think you can tackle those on your own these were the uh Essentials ones that I wanted to address and once again thank you to the author of this website aliser Owens who created this and made it available for free I did not create this website um so you can just go here and without signing up or paying anything you can just do these exercises my final advice for you don’t be afraid of repetition we live in the age of endless content so there’s always something new to do but there’s a lot of value to um repeating the same exercises over and over again when I Was preparing for interviews when I began as a date engineer I did these exercises and altogether I did them like maybe three or four times um and um I found that it was really helpful to do the same exercises over and over again because often I did not remember the solution and I had to think through it all over again and it strengthened those those uh those learning patterns for me so now that you’ve gone through all the exercises and seen my Solutions uh let it rest for a bit and then come back here and try to do them again I think it will be really beneficial in my course I start from the very Basics and I show you in depth how each of the SQL components work I um explore the logical order of of SQL operations and I spend a lot of time in Google Sheets um simulating SQL operations in the spreadsheet coloring cells moving them around making some drawings in excal draw uh so that I can help you understand in depth what’s happening and build those mental models for how SQL operations work this course was actually intended as a complement to that so be sure to check it out

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
March 2, 2025
Key Achievements by 40 That Signal Success Beyond Conventional Metrics

Reaching 40 with a sense of accomplishment often transcends traditional markers like job titles or material wealth. True success lies in cultivating intangible qualities and experiences that foster personal growth, resilience, and meaningful connections. Below are fourteen milestones that reflect a life well-lived, each explored in two detailed paragraphs.

1. Mastery of a Non-Professional Skill
Developing expertise in a skill unrelated to one’s career—such as gardening, playing a musical instrument, or mastering ceramics—signifies a commitment to lifelong learning and self-expression. These pursuits offer a respite from daily routines, allowing individuals to channel creativity and find joy outside professional obligations. For instance, someone who learns furniture restoration not only gains a hands-on craft but also discovers patience and precision, traits that enhance problem-solving in other areas of life.

Beyond personal fulfillment, such skills often ripple into community impact. A home chef might host cooking classes for neighbors, fostering camaraderie, while a fluent speaker of a second language could bridge cultural gaps in their community. These endeavors underscore the value of investing in oneself for both individual enrichment and collective benefit, proving that growth extends far beyond the workplace.

2. Prioritizing Knowledge Sharing Over Material Accumulation
Those who focus on imparting wisdom—through mentoring, creating educational content, or leading workshops—build legacies that outlast physical possessions. A software engineer who tutors underprivileged students in coding, for example, empowers future innovators while refining their own communication skills. This exchange of knowledge strengthens communities and creates networks of mutual support.

The act of sharing expertise also cultivates humility and purpose. By teaching others, individuals confront gaps in their own understanding, sparking curiosity and continuous learning. A retired teacher writing a memoir about classroom experiences, for instance, preserves decades of insight for future generations. Such contributions highlight that true wealth lies not in what one owns, but in the minds one inspires.

3. Embracing a Culturally Expansive Worldview
Engaging deeply with diverse cultures—whether through travel, language study, or friendships with people from different backgrounds—nurtures empathy and adaptability. Someone who volunteers abroad or participates in cultural exchanges gains firsthand insight into global challenges, from economic disparities to environmental issues. These experiences dismantle stereotypes and encourage collaborative problem-solving.

A global perspective also enriches personal and professional relationships. Understanding cultural nuances can improve teamwork in multinational workplaces or foster inclusivity in local communities. For example, a business leader who studies international markets may develop products that resonate across borders. This openness to diversity becomes a compass for navigating an interconnected world with grace and respect.

4. Living by a Personal Philosophy
Crafting a unique set of guiding principles by 40 reflects introspection and maturity. Such a philosophy might emerge from overcoming adversity, such as navigating a health crisis, which teaches the value of resilience. Others might draw inspiration from literature, spirituality, or ethical frameworks, shaping decisions aligned with integrity rather than societal expectations.

This self-defined ethos becomes a foundation for authenticity. A person who prioritizes environmental sustainability, for instance, might adopt a minimalist lifestyle or advocate for policy changes. Living by one’s values fosters inner peace and earns the trust of others, as actions consistently mirror beliefs. This clarity of purpose transforms challenges into opportunities for alignment and growth.

5. Redefining Failure as a Catalyst for Growth
Viewing setbacks as stepping stones rather than endpoints is a hallmark of emotional resilience. An entrepreneur whose first venture fails, for example, gains insights into market gaps and personal leadership gaps, paving the way for future success. This mindset shift reduces fear of risk-taking, enabling bold choices in careers or relationships.

Embracing failure also fosters humility and adaptability. A writer receiving repeated rejections might refine their voice or explore new genres, ultimately achieving breakthroughs. By normalizing imperfection, individuals inspire others to pursue goals without paralyzing self-doubt, creating cultures of innovation and perseverance.

6. Cultivating a Geographically Diverse Network
Building relationships across continents—through expatriate experiences, virtual collaborations, or cultural clubs—creates a safety net of varied perspectives. A professional with friends in multiple countries gains access to unique opportunities, from job referrals to cross-cultural insights, while offering reciprocal support.

Such networks also combat insular thinking. A designer collaborating with artisans in another country, for instance, blends traditional techniques with modern aesthetics, creating innovative products. These connections remind individuals of shared humanity, fostering global citizenship and reducing prejudice.

7. Attaining Financial Autonomy
Financial stability by 40 involves strategic planning, such as investing in retirement accounts or diversifying income streams. This security allows choices like pursuing passion projects or taking sabbaticals, as seen in individuals who transition from corporate roles to social entrepreneurship without monetary stress.

Beyond personal freedom, financial literacy inspires others. A couple who mentors young adults in budgeting empowers the next generation to avoid debt and build wealth. This autonomy transforms money from a source of anxiety into a tool for creating opportunities and generational impact.

8. Committing to Holistic Self-Care
A consistent self-care routine—integrating physical activity, mental health practices, and nutritional balance—demonstrates self-respect. A parent who prioritizes morning yoga amidst a hectic schedule models the importance of health, improving their energy and patience for family demands.

Such habits also normalize vulnerability. Openly discussing therapy or meditation reduces stigma, encouraging others to seek help. By treating self-care as non-negotiable, individuals sustain their capacity to contribute meaningfully to work and relationships.

9. Thriving Through Life’s Transitions
Navigating major changes—divorce, career pivots, or relocation—with grace reveals emotional agility. A professional moving from finance to nonprofit work, for instance, leverages transferable skills while embracing new challenges, demonstrating adaptability.

These experiences build confidence. Surviving a layoff or health scare teaches problem-solving and gratitude, equipping individuals to face future uncertainties with calmness. Each transition becomes a testament to resilience, inspiring others to embrace change as a path to reinvention.

10. Finding Humor in Adversity
Laughing during tough times, like diffusing family tension with a lighthearted joke, fosters connection and perspective. This skill, rooted in self-acceptance, helps individuals avoid bitterness and maintain optimism during crises.

Humor also strengthens leadership. A manager who acknowledges their own mistakes with wit creates a culture where employees feel safe to innovate. This approach transforms potential conflicts into moments of unity and learning.

11. Transforming Passions into Tangible Projects
Turning hobbies into impactful ventures—launching a community garden or publishing a poetry collection—merges joy with purpose. A nurse writing a blog about patient stories, for instance, raises awareness about healthcare challenges while processing their own experiences.

These projects often spark movements. A local art initiative might evolve into a regional festival, boosting tourism and fostering creativity. By dedicating time to passions, individuals prove that fulfillment arises from aligning actions with values.

12. Elevating Emotional Intelligence
High emotional intelligence—empathizing during conflicts or regulating stress—strengthens relationships. A leader who acknowledges team frustrations during a merger, for example, builds trust and loyalty through transparency and active listening.

This skill also aids personal well-being. Recognizing burnout signs and seeking rest prevents crises, modeling healthy boundaries. Emotionally intelligent individuals create environments where others feel seen and valued.

13. Solidifying an Authentic Identity
Resisting societal pressures to conform—like pursuing unconventional careers or lifestyles—affirms self-worth. An artist rejecting commercial trends to stay true to their vision inspires others to embrace uniqueness.

This authenticity attracts like-minded communities. A professional openly discussing their neurodiversity, for instance, fosters workplace inclusivity. Living authentically encourages others to shed pretenses and celebrate individuality.

14. Embracing Lifelong Learning
A growth mindset fuels curiosity, whether through enrolling in courses or exploring new technologies. A mid-career professional learning AI tools stays relevant, proving adaptability in a changing job market.

This attitude also combats stagnation. A retiree taking up painting discovers hidden talents, illustrating that growth has no age limit. By valuing progress over perfection, individuals remain vibrant and engaged throughout life.

In conclusion, these milestones reflect a holistic view of success—one that prioritizes resilience, empathy, and self-awareness. By 40, those who embody these principles not only thrive personally but also uplift others, leaving legacies that transcend conventional achievements.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!

February 27, 2025
China’s AI Surprise: Deepseek and the Open-Source Revolution
DeepSeek, a Chinese AI research lab, has created a surprisingly low-cost, high-performing open-source AI model that rivals leading American models from companies like OpenAI and Google. This breakthrough challenges the previously held belief of American AI supremacy and highlights the potential of open-source models. The development raises concerns about the implications for American leadership in AI, the cost-effectiveness of large language model development, and the potential for Chinese government control over AI narratives. Experts debate whether this signifies China’s catching up or surpassing the US in the AI race and discuss the impact on the future of AI development and investment. The competitive landscape is rapidly evolving, with a focus shifting toward more efficient and cost-effective models, particularly in reasoning capabilities.

China’s AI Leap: A Study Guide

Short Answer Quiz
1. What is Deepseek and why is it significant in the AI landscape?
2. How did Deepseek manage to achieve impressive results with relatively low funding?
3. What are some of the technical innovations that Deepseek employed in developing their AI models?
4. How does Deepseek’s model compare to models from OpenAI, Meta, and Anthropic?
5. What is the significance of Deepseek’s model being open-source?
6. How has China’s AI progress impacted the view of some experts who once believed China was far behind the U.S.?
7. What is the concept of model distillation, and how did Deepseek use it?
8. How are U.S. government restrictions on semiconductor exports impacting China’s AI development?
9. What are the concerns regarding Chinese AI models adhering to “core socialist values”?
10. What does the term “commoditization of large language models” mean in the context of the source material?
Short Answer Quiz – Answer Key
1. Deepseek is a Chinese research lab that has developed a high-performing, open-source AI model. Its significance lies in its ability to achieve top-tier results with far less funding than leading U.S. companies, demonstrating a leap in Chinese AI capabilities.
2. Deepseek achieved impressive results by using less powerful but more readily available chips, optimizing their models’ efficiency, employing techniques like model distillation, and focusing on innovative solutions in training. This resourceful approach helped them bypass U.S. chip restrictions.
3. Deepseek’s technical innovations include using mixture of experts models, achieving numerical stability in training, and figuring out floating point-8 bit training. These solutions allowed them to train their models more efficiently with less computing power.
4. Deepseek’s model has been shown to outperform some models from OpenAI, Meta, and Anthropic in certain benchmarks, often at a fraction of the cost. It has also demonstrated strong capabilities in math, coding, and reasoning.
5. The open-source nature of Deepseek’s model is significant because it allows developers to build upon it and customize it for their needs without incurring high development costs. This accessibility could lead to broader adoption, challenging the dominance of proprietary models.
6. Experts like former Google CEO Eric Schmidt, who previously thought the U.S. was ahead of China in AI by 2-3 years, now acknowledge that China has caught up significantly in a short period, highlighting the rapid advancements made in the Chinese AI sector.
7. Model distillation involves using a large, complex model to train a smaller, more efficient model. Deepseek used this process to transfer the knowledge and capabilities of large models to their smaller ones, resulting in cost and efficiency improvements.
8. U.S. restrictions on semiconductor exports, specifically high-end GPUs, have limited the amount of computing power available to Chinese AI developers. However, China has innovated ways to work with lower end GPUs and still achieve significant breakthroughs in the AI field.
9. There are concerns about Chinese AI models being required to adhere to “core socialist values” as this can lead to censorship, denial of human rights abuses, and political bias. This raises issues of trust and the potential for autocratic control of AI.
10. The “commoditization of large language models” refers to the increasing availability and decreasing cost of high-quality AI models, including open-source options. This trend is making the technology more accessible to a broader range of developers, disrupting the dominance of expensive, closed-source models.
Essay Questions
1. Analyze the impact of Deepseek’s breakthrough on the competitive landscape of the AI industry, particularly for leading American firms like OpenAI.
2. Discuss the strategic implications of China’s open-source AI model for the future of global technology infrastructure and international relations.
3. Evaluate the claim that U.S. government restrictions on semiconductor exports have inadvertently spurred innovation in China’s AI sector.
4. Compare and contrast the open-source and closed-source approaches to AI development, using examples from the text and considering their respective advantages and disadvantages.
5. Explore the ethical and societal implications of widely available, potentially biased, AI models, focusing on the contrasting values of democratic and autocratic AI systems.
Glossary of Key Terms

Artificial General Intelligence (AGI): A hypothetical type of AI that is capable of understanding, learning, and applying knowledge across a wide range of tasks at the level of a human being.

Closed-source model: AI models where the underlying code and training data are proprietary and not accessible to the public. Examples include OpenAI’s GPT models.

Commoditization: The process by which a product or service becomes widely available, less differentiated, and cheaper. In the context of AI, it refers to the increasing availability of high-quality language models.

Distillation (model): A training technique where a large, complex model (the “teacher”) is used to train a smaller, more efficient model (the “student”).

Floating Point-8 (FP8) Training: A numerical precision format used in machine learning that can reduce memory usage and accelerate training without significant accuracy loss. It can improve efficiency by making training stable.

GPU (Graphics Processing Unit): A specialized electronic circuit designed to accelerate the creation of images and perform general-purpose computations required for AI model training.

Large Language Model (LLM): A type of AI model trained on a vast amount of text data, capable of understanding and generating human-like text.

Mixture of Experts (MoE): A type of neural network architecture that combines multiple specialized sub-networks (experts) to tackle complex tasks more effectively.

Open-source model: AI models where the underlying code, training data, and model parameters are accessible to the public, allowing for free use, modification, and distribution.

Reasoning Model: An AI model that can perform logical analysis and problem-solving beyond pattern recognition, thinking and deducing information rather than just generating responses based on inputs.

Reinforcement Learning: A type of machine learning where an agent learns to make decisions by trial and error, guided by rewards or penalties.

Semiconductor Restrictions: Government policies that restrict or control the export of semiconductor technology, often motivated by national security or economic reasons.

Token: In the context of language models, a token is a unit of text that is processed by the model (words, parts of words, punctuation marks, etc.).

Transformer: A neural network architecture that has revolutionized natural language processing. It uses self-attention mechanisms to weigh the importance of different parts of an input.

China’s AI Rise: Deepseek’s Impact on the Global Landscape

Okay, here is a detailed briefing document analyzing the provided source material:

Briefing Document: China’s AI Breakthrough and Implications

Date: October 26, 2024

Subject: Analysis of China’s AI advancements, particularly Deepseek’s breakthroughs, and their impact on the global AI landscape, including the US AI industry.

Sources: Excerpts from “Pasted Text”

Executive Summary:

This briefing analyzes recent developments in Chinese AI, particularly the emergence of Deepseek, an AI lab that has created an open-source model that rivals and in some cases surpasses leading American models, such as those from OpenAI and Anthropic, at a significantly lower cost. The implications are far-reaching, challenging the assumption of US AI dominance, and raising concerns about the potential for a shift in global AI leadership. The briefing examines the nature of Deepseek’s achievement, the strategic context of the US-China AI race, and the potential impact on companies like OpenAI.

Key Themes and Ideas:
1. Deepseek’s Unexpected Breakthrough:
- Cost Efficiency: Deepseek developed a highly competitive AI model (Deepseek v3) for a reported $5.6 million, compared to billions spent by US counterparts like OpenAI and Google. This is a major shock to the Silicon Valley AI industry.
- Quote: “The AI lab reportedly spent just $5.6 million dollars to build Deepseek version 3. Compare that to OpenAI, which is spending $5 billion a year, and Google, which expects capital expenditures in 2024 to soar to over $50 billion.”
- Performance: Deepseek’s open-source model outperforms Meta’s Llama, OpenAI’s GPT 4-O, and Anthropic’s Claude Sonnet 3.5 on accuracy tests, including math problems, coding competitions, and bug fixing. Their reasoning model (R1) also rivals OpenAI’s o1 on certain tests.
- Quote: “It beat Meta’s Llama, OpenAI’s GPT 4-O and Anthropic’s Claude Sonnet 3.5 on accuracy on wide-ranging tests.”
- Efficiency Focus: The company effectively utilized less powerful Nvidia H-800 GPUs instead of the highly sought-after H-100s, demonstrating that export controls weren’t the chokehold the U.S. intended. They achieved this through innovations in how they trained their model, which suggests the efficiency of their model may be more important than the raw compute they had available.
- Open Source: Deepseek’s model is open-source, allowing developers to freely use and customize the technology.
- Implications They’ve made a dent in the thought that developing cutting-edge AI requires billions of dollars in investment, opening the door for smaller firms to compete and potentially make further innovations based on Deepseek’s open source model.
1. Shifting Perceptions of China’s AI Capabilities:
- Rapid Catch-Up: Contrary to previous predictions that China was years behind, it has made rapid advancements. Former Google CEO Eric Schmidt acknowledges that China has caught up remarkably in the last six months.
- Quote: “I used to think we were a couple of years ahead of China, but China has caught up in the last six months in a way that is remarkable.”
- Innovation: Deepseek’s technical solutions, such as Mixture of Experts architecture training, and floating point-8 training, demonstrate innovative capabilities, not just imitation.
- Quote: “the reality is, some of the details in Deep seek v3 are so good that I wouldn’t be surprised if Meta took a look at it and incorporated some of that –tried to copy them.”
- Challenging U.S. Superiority: China’s AI advancements undermine the perception of an unassailable US lead and raise the question of how wide AI’s moat really is.
1. The Strategic Context of the US-China AI Race:
- U.S. Restrictions Backfire: US export restrictions, designed to slow down China’s AI development, ironically spurred innovation by forcing Chinese labs to develop more efficient approaches with limited resources.
- Quote: “Necessity is the mother of invention. Because they had to go figure out workarounds, they actually ended up building something a lot more efficient.”
- Geopolitical Stakes: The AI race has significant geopolitical implications, as dominance in AI could translate to economic and global leadership.
- Concerns About Autocratic AI: There’s concern that AI models from China, which have to adhere to “core socialist values,” could promote censorship, deny human rights abuses, and filter criticism of political leaders. This raises questions about whether the AI of the future will be informed by democratic values, or whether it will be driven by autocratic agendas.
1. Implications for the AI Industry and OpenAI
- Open-source Threat: The emergence of powerful, open-source models challenges the dominance of closed-source leaders like OpenAI.
- Cost Pressure: Deepseek and similar efforts place pressure on closed-source models to justify their cost as nimbler competitors emerge
- Model commoditization: The trend is showing a commoditization of LLMs, meaning the importance is shifting to other innovations like reasoning capacities.
- OpenAI’s Strategy: OpenAI might need to pivot away from pre-training and large language models and toward different areas of innovation such as reasoning capabilities.
- Quote: “I think they’ve already moved to a new paradigm called the o1 family of models.”
- Brain Drain: OpenAI is experiencing brain drain which will make the race for AI dominance harder.
- Money Trap: There’s the potential that AI model building is a money trap and that continued investment might not yield expected returns.
1. The Importance of Open Source and Potential Risks:
- Developer Migration: Developers tend to migrate to open-source models that are better and cheaper.
- Mindshare and Ecosystem: The open-sourcing of a Chinese model means they could capture mindshare and control the ecosystem.
- Quote: “It’s more dangerous because then they get to own the mindshare, the ecosystem.”
- Licensing Risks: While licenses for open-source models are favorable today, they could be changed, potentially closing off access.
The Role of Perplexity
- Model-Agnostic approach: Perplexity co-founder and CEO Arvind Srinivas highlights that Perplexity is model-agnostic, meaning they are focused on building a user experience rather than on building models themselves.
- Adoption of Deepseek: Perplexity has begun using Deepseek’s model, both through its API and by hosting it themselves, which further indicates Deepseek’s importance.
- Monetization Strategy: Perplexity is experimenting with a novel ad model that seeks to present ads in a truthful way rather than forcing users to click on links they don’t want to.
- Killer Application Focus: Perplexity focuses on developing applications of generative AI, rather than on the very costly challenge of model development.
- Reasoning and Future Trends: Perplexity is focusing on the development of sophisticated reasoning agents, indicating that reasoning is the next frontier in AI, and that the age of pre-training is coming to a close.
Conclusion:

Deepseek’s AI breakthrough represents a significant challenge to US AI leadership and has fundamentally shifted the landscape of the global AI race. The combination of its performance, efficiency, low cost, and open-source nature is forcing a reevaluation of investment strategies and technological advantages in the AI field. This could lead to a new era where smaller organizations can compete, and open-source models gain wider acceptance, even if it means that the U.S. has lost its edge on the bleeding edge of AI. This comes with some risks, particularly the potential control of mindshare and ecosystem by a Chinese entity, as well as the risk that the license could be revoked. It is also likely that the cost of innovation in the AI space will fall due to the efficiency breakthroughs being developed in China.

Recommendations:
- Monitor Deepseek’s and similar Chinese AI labs’ progress closely.
- Support American companies focused on building and innovating in the open source model space.
- Explore new strategies that are not purely focused on model training, but rather new capabilities and applications of AI.
- Invest in talent, research, and development to ensure competitiveness.
- Prioritize the development of democratic AI informed by democratic values.
This briefing provides a comprehensive overview of the key issues surrounding the rise of Deepseek and its impact on the global AI landscape. Continued monitoring of this fast-moving field is crucial.

Deepseek’s AI Breakthrough: Impact and Implications

FAQ: The Impact of Deepseek’s AI Breakthrough
1. What is Deepseek and why is it significant in the AI landscape? Deepseek is a Chinese AI research lab that has developed a powerful, open-source AI model. Its significance lies in its ability to achieve performance comparable to leading American models like OpenAI’s GPT-4 and Anthropic’s Claude Sonnet, but at a fraction of the cost and time. Deepseek reportedly spent just $5.6 million and two months developing its version 3, compared to billions of dollars and years of effort by leading US AI companies. This has led many to re-evaluate the feasibility of efficiently developing cutting edge AI models and has shaken the status quo of large, costly model development.
2. How did Deepseek manage to develop such a high-performing model with limited resources, especially given U.S. semiconductor restrictions? Deepseek’s success is largely attributed to innovative and efficient techniques, a scrappy approach driven by necessity. Due to U.S. restrictions on exporting high-end GPUs like Nvidia H100s to China, they trained on less powerful H800 GPUs, they employed techniques such as model distillation (using large models to train small models), 8-bit floating point training, and mixture of experts architecture. They also reportedly leveraged existing open source models, data and architecture. These methods enabled them to achieve optimal efficiency and maximize the utility of their limited resources, thereby demonstrating that advanced AI development is not solely reliant on expensive, state-of-the-art hardware.
3. What is meant by the term “open-source” in the context of Deepseek’s model, and why is this important? An open-source AI model, like Deepseek’s, means its code, architecture, and training weights are publicly accessible. This enables developers to freely use, customize, and build upon the model. The open-source nature of Deepseek’s model is significant because it lowers the barrier to entry for AI development, enabling smaller teams and organizations with limited capital to participate in cutting-edge AI innovation. It also means that innovation could be decentralized and accelerated through collaboration, rather than being solely in the hands of closed-source tech giants. Open-source is also very attractive to developers as it is typically less expensive and provides more flexibility.
4. How does Deepseek’s performance compare to other leading AI models? Deepseek’s model has demonstrated impressive results in various benchmark tests, including math problems, AI coding evaluations, and bug identification. It has reportedly outperformed models such as Meta’s Llama, OpenAI’s GPT-4-O, and Anthropic’s Claude Sonnet 3.5 in certain tests. Furthermore, its R1 reasoning model has also shown comparable performance to OpenAI’s O1 model. This parity in performance, especially given the significantly lower development costs, has shocked many in the AI field.
5. How has Deepseek’s breakthrough impacted the perceived “moat” of leading AI companies like OpenAI? Deepseek’s rise has significantly challenged the notion of a technological “moat” around closed-source AI models. Before this, the assumption was that immense capital expenditure and specialized hardware were necessary to develop advanced models. The lower cost of development by Deepseek has highlighted that innovation can be achieved through efficiency and creative approaches to model training, therefore undercutting the perceived advantage of massive investment in hardware by the leading players like OpenAI. It suggests that any company claiming to be at the AI frontier today could quickly be overtaken by nimbler, more efficient competitors.
6. What are some of the potential risks and concerns associated with the widespread adoption of Chinese open-source models like Deepseek? While the open-source nature of Deepseek has advantages, its adoption carries potential risks. Primarily, since the model was developed in China, it is subject to Chinese laws and regulations that require models to adhere to “core socialist values.” This raises concerns about potential censorship, bias, or manipulation of information within AI-generated responses. In addition, there’s a risk that the license for an open-source model could change over time, potentially limiting its use or creating proprietary lock-in for early adopters. If American developers increasingly rely on Chinese open-source models, it could undermine US leadership in AI and give China greater control of the global tech infrastructure.
7. What does Deepseek’s emergence indicate about the future of AI development and the ongoing race between China and the U.S.? Deepseek’s emergence indicates a shift towards more efficient and cost-effective AI development practices. The necessity to overcome hardware restrictions actually encouraged China to find workarounds and creative solutions. This event has shifted perceptions of a Chinese AI disadvantage and has demonstrated that the country is capable of innovation as well as imitation. It suggests the AI race is not solely about financial investment and access to high-end hardware, but also about ingenuity and efficient utilization of resources. Open source is likely to drive innovation in the future as well. The AI race will also likely become more diverse in the future as there is less of a need to have enormous amounts of compute power.
8. What is Perplexity’s perspective on the implications of Deepseek’s model, and how is the company responding? Perplexity, an AI search company, acknowledges the disruptive potential of Deepseek’s open-source model. It has begun incorporating Deepseek into its services as a way to lower costs. The company sees the commoditization of large language models as a benefit and is shifting focus to applications. Perplexity’s leadership believes that the focus will shift to reasoning abilities as pre-training gets commoditized, and that these models will also improve, become cheaper, and be adopted by other companies. This means that Perplexity is looking at a future where it focuses on complex applications of AI, while utilizing the cheaper and more readily available large language models that are coming to market.
China’s Rise in AI: Open Source, Cost-Effective, and Competitive

China has made significant advances in the field of artificial intelligence (AI), challenging the perceived dominance of the United States [1, 2]. Here are some key points about China’s AI progress:
- Technological breakthroughs: Chinese AI labs, such as Deepseek, have developed open-source AI models that rival or surpass the performance of leading American models like OpenAI’s GPT-4o, Meta’s Llama, and Anthropic’s Claude Sonnet 3.5 [1]. Deepseek’s models have demonstrated superior accuracy in math problems, coding competitions, and bug detection [1]. Deepseek also developed a reasoning model called R1 that outperformed OpenAI’s cutting-edge model in third-party tests [1].
- Cost-effectiveness: Deepseek was able to build its impressive model for a fraction of the cost of American AI companies, reportedly spending just $5.6 million compared to the billions spent by companies like OpenAI, Google, and Microsoft [1]. Other Chinese companies, like Zero One Dot AI and Alibaba, have also shown the ability to produce effective models at lower costs [2]. This cost efficiency is achieved through innovative techniques such as distillation (using a large model to help a smaller model get smarter), and efficient hardware usage [3, 4].
- Overcoming restrictions: Despite U.S. government restrictions on exporting high-powered chips to China, Deepseek has found ways to achieve breakthroughs by using less powerful chips (Nvidia’s H-800s) more efficiently, challenging the idea that the chip export controls were an effective chokehold [4]. They also achieved numerical stability in training, allowing them to rerun training runs on more or better data [5].
- Open-source approach: China is leaning towards open-source AI models which are cheaper and more attractive for developers [6]. Deepseek’s model is open-source, allowing developers to customize and fine-tune it [7]. The wide adoption of these models could shift the dynamics of the AI landscape, potentially undermining U.S. leadership in AI [6].
- Innovation, not just imitation: While it was once thought that China was just copying existing AI technologies, Deepseek has shown real innovation in its models. For example, Deepseek has developed clever solutions to balance mixture of experts models without adding additional hacks, and they also figured out floating point-8 bit training [5].
- Implications: China’s advances in AI have several implications:
- Increased Competition: The rapid progress of Chinese AI models increases competition for American AI companies, which have until now been seen as leaders in the field [2].
- Potential Shift in Global AI: The adoption of Chinese open-source models could undermine U.S. leadership while embedding China more deeply into the fabric of global tech infrastructure [6].
- Concerns about control and values: AI models built in China are required to adhere to rules set by the Chinese Communist Party and embody “core socialist values,” leading to concerns about censorship and the promotion of an autocratic AI [6].
- Investment landscape: The success of Deepseek has led to questions about the sustainability of large spending on individual large language models and has led to a shift in focus towards reasoning and other aspects of AI [7, 8].
- Reasoning as the next frontier: There is a shift in focus to models that can reason and solve complex problems [7]. Although OpenAI’s o1 model has cutting-edge reasoning capabilities, researchers are finding ways to build reasoning models for much less [7]. It is expected that China will turn its attention to reasoning models [9].
- Commoditization of models: With the open-source availability of models like Deepseek, large language models are becoming commoditized, which means that innovation will need to happen in other areas of AI [10].
In conclusion, China’s AI advancements, particularly the emergence of cost-effective and high-performing open-source models, have significantly altered the AI landscape. This has sparked a debate about the future of AI development, competition, and the potential for a shift in global leadership in the field.

Open-Source AI: A New Era

Open-source AI models have become a significant factor in the current AI landscape, with the emergence of models like Deepseek’s offering a new approach to AI development [1, 2]. Here’s a breakdown of key aspects:
- Accessibility and Cost-Effectiveness: Open-source models are generally free and accessible to the public, allowing developers to use, customize, and fine-tune them [1, 3]. This is in contrast to closed-source models, which often require significant investment to access and utilize [4]. Deepseek’s model is an example of a high-performing open-source model that is also very cost-effective [1, 5]. This means developers can build applications and conduct research without incurring the high costs associated with proprietary models [2]. The inference cost of Deepseek’s model is 10 cents per million tokens, which is 1/30th of the cost of a typical comparable model [2].
- Rapid Development and Innovation: Open-source models enable developers to build on existing technology rather than starting from scratch [4]. This accelerates the pace of innovation, allowing for more rapid advancements in the field [1, 6]. By building on the existing frontier of AI, Deepseek was able to close the gap with leading American AI models [4]. This approach makes it significantly easier to reach the forefront of AI development with smaller budgets and teams [6].
- Community-Driven Improvement: Open-source models benefit from a community of developers who contribute to their improvement. This collaborative approach can lead to more robust and versatile models. However, some open source models, like Deepseek, are not totally transparent [7].
- Potential Shift in AI Dynamics: The widespread adoption of powerful open-source models is changing the dynamics of AI development [6]. It could lead to a more decentralized and collaborative approach to AI, shifting power away from companies that rely on closed-source models [2]. This also puts pressure on closed-source leaders to justify their costlier models [4]. The prevailing model in global AI may shift to open-source as organizations and nations realize that collaboration and decentralization can drive innovation faster and more efficiently [2].
- Competition and Copying: The open nature of these models can foster competition and accelerate the rate at which new models and capabilities appear [3, 4]. It has become common for companies to emulate and incorporate the innovations of others into their models [4]. It is not clear if Deepseek copied outputs from ChatGPT, or whether it is innovative, as the internet is full of AI-generated content [8, 9].
- Concerns about Control: There are concerns about the potential for open-source models to be used for malicious purposes [2, 10]. Additionally, open-source licenses can be changed over time, meaning that a currently free and open model could become restricted in the future [2, 7].
- Trust and Transparency: There are questions about whether to trust open-source models coming from other countries, for example, whether to trust a model from China [7, 11]. However, the ability to run an open-source model on one’s own computer gives the user control over how the model is used [7].
In conclusion, open-source AI models represent a significant shift in the AI landscape, offering a more accessible, collaborative, and cost-effective approach to development. The emergence of powerful open-source models, such as those from Deepseek, is challenging the dominance of closed-source models and is sparking debates about the future of AI development, competition, and global leadership in this field [1, 2, 6].

Cost-Effective AI: A New Paradigm

Cost-effective AI is a significant development in the field, challenging the notion that AI development requires massive financial investment. Several sources highlight how certain organizations are achieving impressive results with significantly lower spending [1-3]. Here’s a breakdown of the key aspects of cost-effective AI:
- Lower Development Costs: Some AI labs, particularly in China, have demonstrated the ability to develop powerful AI models at a fraction of the cost compared to their American counterparts [1, 3]. For example, Deepseek reportedly spent only $5.6 million to build their version 3 model, whereas companies like OpenAI and Google are spending billions annually [1]. Other Chinese AI companies like Zero One Dot AI have trained models with just $3 million [3]. This cost-effectiveness is a significant departure from the massive spending typically associated with AI development [1].
- Efficient Use of Resources: Cost-effective AI development often involves finding ways to use resources more efficiently. This includes using less powerful hardware and optimizing training methods [2, 4]. Deepseek, for instance, used Nvidia’s H-800 chips, which are less performant than the H-100s, to build its latest model [2]. They were also able to use their hardware more efficiently [2]. They also developed clever solutions to balance their mixture of experts model without additional hacks [5]. They also used floating point-8 bit training, which is not well understood, to reduce memory usage, while maintaining numerical stability [6].
- Innovative Techniques: Cost-effective AI leverages innovative techniques like distillation, where a large model is used to help a smaller model get smarter [7]. This allows for the creation of capable models without the need for massive computing resources and training costs [7]. By iterating on existing technologies, they can avoid reinventing the wheel [7].
- Open-Source Advantage: Open-source models contribute to cost-effectiveness by making technology more accessible and shareable [8, 9]. Developers can build on existing open-source models, reducing the time and expense of developing new ones from scratch [3, 7]. This accelerates the pace of innovation and allows smaller teams with lower budgets to jump to the forefront of the AI race [3]. Deepseek’s open-source model, which is available for free, also has an inference cost of 10 cents per million tokens, which is 1/30th of what typical models charge [9].
- Impact on the Market: The rise of cost-effective AI models is disrupting the AI market [3, 7]. Companies like OpenAI, which have invested heavily in closed-source models, are facing increased competition from more nimble and efficient competitors [7]. The success of cost-effective AI has raised questions about the wisdom of massive spending on individual large language models [8]. It is making the AI model building a “money trap,” according to one source [8].
- Shifting Investment Landscape: The emergence of cost-effective AI is causing a shift in the investment landscape. There’s now more focus on reasoning capabilities and other areas of AI, instead of just building bigger and more expensive models [8]. This change signals a shift in the AI field where creativity is as important as capital [8].
- Necessity as a Driver: Restrictions on access to high-end chips pushed Chinese companies to innovate with limited resources, ultimately leading to more efficient solutions [4, 8]. As one source puts it, “necessity is the mother of invention” [4, 8]. By having to work with less, they were forced to find creative ways to achieve the same results [4, 8].
In conclusion, cost-effective AI represents a significant shift in the AI landscape. It demonstrates that cutting-edge AI models can be developed with less capital through innovative techniques, efficient resource utilization, and open-source collaboration. This trend is reshaping the competitive dynamics of the AI industry and challenging the traditional model of massive investments in large language models.

US-China AI Competition: A Shifting Landscape

The sources highlight a dynamic and rapidly evolving landscape of AI competition, particularly between the United States and China, with other players also emerging. Here’s a breakdown of key aspects of this competition:
- Shifting Global Leadership: The AI race is no longer solely dominated by the U.S. [1, 2]. China’s rapid advancements in AI, particularly through the development of highly efficient and cost-effective models, have positioned it as a major competitor in the field [1, 3, 4]. This challenges the previous perception that China was lagging behind by 2-3 years [1].
- Cost-Effectiveness as a Competitive Edge: Chinese AI labs like Deepseek and Zero One Dot AI have demonstrated the ability to produce competitive models with significantly lower budgets compared to their U.S. counterparts [1, 3, 5]. This cost-effectiveness is achieved through efficient resource use, innovative techniques, and a focus on iterating on existing technology [4-7]. This challenges the notion that massive investment is necessary to achieve top-tier AI results [6, 8, 9]. The emergence of cost-effective models is also putting pressure on closed-source companies like OpenAI to justify their more expensive models [6].
- Open-Source vs. Closed-Source Models: The rise of open-source AI models, particularly from China, is a major factor in the competition [1, 3, 10]. These models are more accessible, customizable, and cost-effective for developers [10, 11]. This challenges the dominance of closed-source models and could lead to a shift in the AI landscape where open-source becomes the prevailing model [10]. However, the open-source license could be changed by the source, and there are concerns about whether to trust open-source models from certain countries [10, 12].
- Technological Innovation: The competition is driving rapid innovation in AI [1, 3]. Chinese companies have demonstrated innovative solutions, such as floating point-8 bit training and clever balancing of mixture of experts models [5, 7]. They also are using the available data sets with innovative tweaks [6]. American companies may start copying some of these innovations [7].
- Reasoning as a New Frontier: The focus of AI development is shifting towards reasoning capabilities, and the competition will likely extend to this new area [8, 13]. While OpenAI’s o1 model currently leads in this area, other players are expected to catch up [13]. There are now low cost options for developing reasoning models [8].
- Impact of U.S. Restrictions: The U.S. government’s restrictions on exporting high-end chips to China were intended to slow down their progress [2, 8]. However, these restrictions may have backfired by forcing Chinese companies to find creative solutions that have resulted in more efficient models [2, 4, 8].
- Talent and Ecosystem: There are questions about whether the best talent in AI will continue to be drawn to the companies that were the pioneers, or if the most efficient models and ecosystems will attract the most talent [14]. The open-source model may give Chinese models an edge, if all the American developers are building on that [11].
- Concerns about Values and Control: The competition also raises concerns about control over AI and the values that AI models promote. Chinese AI models are required to adhere to “core socialist values,” leading to concerns about censorship and the potential for autocratic AI [10].
- Commoditization of Models: As AI models become more readily available and open-source, they are also becoming commoditized [9]. This shift means that innovation and competition will need to focus on other areas, such as real-world applications, reasoning capabilities, and multi-step analysis [14, 15].
In conclusion, the AI competition is intense, with a shift in the balance of power towards China, driven by its ability to produce cost-effective and high-performing models. The rise of open-source models and the focus on reasoning are reshaping the landscape, creating both opportunities and challenges for companies and nations involved in the AI race.

The US-China AI Race

The AI race between the US and China is a central theme in the sources, characterized by intense competition, rapid innovation, and shifting global leadership [1-3]. Here’s a breakdown of the key aspects of this competition:
- Shifting Global Leadership: The AI race is no longer dominated solely by the US [2, 4]. China has made remarkable advancements, quickly catching up and, in some areas, surpassing the US [4, 5]. This has challenged the previous assumption that China was significantly behind the US in AI development [4].
- Cost-Effectiveness as a Competitive Strategy: Chinese AI labs have demonstrated the ability to develop powerful AI models with significantly less capital than their American counterparts [4, 5]. For example, Deepseek spent only $5.6 million to build its version 3 model, while US companies spend billions [5]. This cost-effectiveness is achieved through efficient resource use, innovative techniques like distillation, and by iterating on existing technology rather than reinventing the wheel [5, 6].
- Open-Source Models: The rise of open-source AI models, particularly those from China, is a critical factor in the competition [2, 5, 7, 8]. These models are more accessible, customizable, and cost-effective for developers [5, 7]. The widespread adoption of these models could lead to a shift in the AI landscape, where open-source becomes the prevailing model [7, 8]. However, it is important to note that open-source licenses can be changed and there are questions about whether to trust open-source models from certain countries [7, 9]. Deepseek’s model is a leading example of an open-source model that outperforms some closed-source models from the US [5].
- Technological Innovation: The competition is driving rapid innovation in AI on both sides. Chinese companies have showcased ingenuity in areas such as floating point-8 bit training and clever balancing of their mixture of experts models, demonstrating their ability to overcome resource limitations [10, 11]. Deepseek used Nvidia’s less performant H-800 chips to build their model, showing that export controls on advanced chips were not a chokehold as intended [1].
- Reasoning as the New Frontier: The focus in AI development is shifting towards reasoning capabilities, marking a new competitive area [12, 13]. While OpenAI’s o1 model leads in reasoning, other players, including China, are expected to catch up [13, 14]. Researchers at Berkeley showed that they could build a reasoning model for only $450 [12].
- Impact of U.S. Restrictions: The U.S. government’s restrictions on exporting high-end chips to China, aimed at slowing down their progress, may have inadvertently backfired [1, 12]. These restrictions forced Chinese companies to innovate with limited resources, ultimately leading to more efficient models [2, 12].
- Concerns about Values and Control: There are concerns about the values that AI models promote. Chinese AI models must adhere to “core socialist values,” raising concerns about censorship and the potential for autocratic AI [7]. This is a point of concern for democratic countries that seek to ensure that AI is informed by democratic values [7].
- Competition and Copying: The sources indicate that in AI development, everyone is copying each other. For example, Google developed the transformer technology first, but OpenAI productized it [6, 15]. It is not clear whether Deepseek copied outputs from ChatGPT, or whether it is innovative, given that the internet is full of AI-generated content [6, 11].
- Talent and Ecosystem: It is not yet clear whether the best talent will continue to gravitate to the companies that were the pioneers, or if the most efficient models and ecosystems will attract the most talent [15]. If American developers are using Chinese open-source models, this may give China an edge [8].
- Commoditization of Models: As AI models become more readily available and open-source, they are also becoming commoditized [14, 16]. This shift means that innovation and competition will need to focus on other areas, such as real-world applications, reasoning capabilities, and multi-step analysis [15, 16].
In conclusion, the US-China AI race is a complex and multifaceted competition characterized by rapid innovation, cost-effectiveness, and the emergence of open-source models. China has closed the gap and is now a major competitor in the AI space, challenging the previous dominance of the US. The race is driving both progress and concerns about the future of AI development, including issues of control, values, and global leadership [2, 8].

How China’s New AI Model DeepSeek Is Threatening U.S. Dominance

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 20, 2025
DeepSeek AI: A Wake-Up Call for the US Tech Industry
The emergence of DeepSeek, a low-cost, high-performing AI chatbot from a Chinese startup, has sent shockwaves through the American tech industry. DeepSeek’s surprisingly low development cost ($6 million) compared to its American competitors’ billions, coupled with its competitive performance, challenges established assumptions about AI development. This event has prompted concerns about US competitiveness and a reassessment of investment strategies, while also sparking debate over the implications of open-source AI models versus closed-source approaches. The situation highlights the intensifying global AI race and raises questions regarding data handling, bias, and the potential for protectionist reactions.

AI Race: Deep Seek & Global Implications

Quiz

Instructions: Answer each question in 2-3 sentences.
1. What is Deep Seek and why has it caused concern in the US tech industry?
2. How did Deep Seek manage to develop its AI model at a fraction of the cost compared to US companies?
3. What does it mean that Deep Seek’s model is “open source,” and what are the implications for data and censorship?
4. How has the emergence of Deep Seek impacted Nvidia, a major chip manufacturer in the US?
5. What is AGI, and why is Deep Seek’s model being seen as a potential step towards it?
6. What is the “Stargate” project proposed by Donald Trump, and what is its goal?
7. According to the text, how does the Chinese government’s approach to AI regulation compare to that of the US?
8. How does Deep Seek’s approach to AI model development challenge the traditional approaches used by US companies?
9. Besides AI, in what other technological fields is China showing significant advancement?
10. How are the US sanctions on China potentially impacting China’s technological development in the long run?
Quiz Answer Key
1. Deep Seek is a Chinese AI startup that has developed a highly capable AI chatbot at a significantly lower cost than US competitors. This has caused concern because it suggests that the US dominance in AI could be challenged, and that high costs associated with AI development may not be necessary.
2. Deep Seek was able to develop its model at a fraction of the cost by utilizing less powerful, older chips (due to US export controls) and leveraging open-source technology, which allowed for more efficient development and a different approach. This innovative process challenged the existing US industry assumptions.
3. Being “open source” means that the code for Deep Seek’s model is publicly available, allowing others to modify and build on it, and creating more opportunities for innovation. However, the user-facing app is censored to align with Chinese regulations, which filters politically sensitive information.
4. The emergence of Deep Seek has had a negative impact on Nvidia, as it has caused investors to reconsider the cost of the chips needed for AI, which had been the primary driver for Nvidia’s success. This led to a substantial decrease in the company’s market value, showing that expensive chips may not be necessary for cutting edge AI.
5. AGI, or Artificial General Intelligence, refers to an AI that can think and reason like a human being. Deep Seek’s model is seen as a step toward AGI because its ability to learn from other AIs suggests the potential for AI to improve itself, leading to a “liftoff” point where AI capabilities increase exponentially.
6. The “Stargate” project is a $500 billion initiative proposed by Donald Trump to build AI infrastructure in the US. It aims to strengthen US competitiveness in AI, and it is a direct response to China’s advancements in the field.
7. The Chinese government has strict regulations and laws regarding how AI models should be developed and deployed, specifically concerning how AI answers politically sensitive questions. These regulations are described as more restrictive than those in the US and in line with national security interests.
8. Deep Seek’s approach challenges the US approach by utilizing open source technology and more efficient methods for model development. This is in contrast to most US companies which have relied on expensive and proprietary technology and the notion that AI development required large investments.
9. Besides AI, China is also showing significant advancement in fields such as 5G technology (with companies like Huawei), social media apps (like TikTok and Red Note), and electric vehicles (with brands like BYD and Nio), and nuclear fusion technology. These fields highlight China’s growing tech self-sufficiency and strategic tech goals.
10. The US sanctions on China, intended to slow down technological advancements, may have ironically backfired. By cutting off the supply of the latest chips, the restrictions have actually forced Chinese companies to innovate and find more efficient ways to develop AI, thus accelerating their technological progress and reducing reliance on US tech.
Essay Questions

Instructions: Write an essay addressing one of the following prompts.
1. Analyze the political and economic implications of Deep Seek’s emergence, considering its impact on US tech dominance and the global AI race.
2. Explore the technological innovations and development strategies behind Deep Seek’s low-cost AI model and how it challenges established norms in the AI industry.
3. Discuss the ethical concerns surrounding AI development and deployment, focusing on issues such as censorship, data handling, and bias in the context of Deep Seek’s model.
4. Evaluate the potential long-term effects of US sanctions on China’s technology sector, considering their impact on global AI competition and the pursuit of self-sufficiency.
5. Assess the role of open-source technology in the AI race and how the open sourcing of AI models such as Deep Seek can affect AI development.
Glossary of Key Terms

Artificial Intelligence (AI): The capability of a machine to imitate intelligent human behavior, often through learning and problem-solving.

Artificial General Intelligence (AGI): A hypothetical type of AI that possesses human-level intelligence, capable of performing any intellectual task that a human being can.

Open Source Technology: Software or code that is available to the public, allowing for modification, distribution, and development by anyone.

Censorship: The suppression of words, images, or ideas that are considered objectionable, offensive, or harmful, particularly in a political or social context.

Export Controls: Government regulations that restrict or prohibit the export of certain goods or technologies to specific countries or entities.

Nvidia: A major US technology company that designs and manufactures graphics processing units (GPUs), which are essential for AI development.

Deep Seek: A Chinese AI startup that developed a powerful AI chatbot at a much lower cost than its competitors.

Stargate Project: A proposed $500 billion US initiative to build AI infrastructure as announced by former US President Donald Trump.

Liftoff: A term used in the AI context to describe a point where AI learning and development becomes exponential due to AI learning from other AI models.

Data Bias: Systematic errors in data that can result in AI models making unfair or discriminatory decisions.

DeepSeek: A Wake-Up Call for the AI Industry

Okay, here is a detailed briefing document analyzing the provided sources about the DeepSeek AI chatbot:

Briefing Document: DeepSeek AI Chatbot – A Wake-Up Call

Executive Summary:

The emergence of DeepSeek, a Chinese AI chatbot, has sent shockwaves through the global tech industry, particularly in the US. Developed at a fraction of the cost of its Western counterparts, DeepSeek rivals leading models like ChatGPT in performance, while using less computational power and older chip technology. This breakthrough challenges long-held assumptions about AI development and has sparked debate about competition, open-source technology, and the future of AI dominance. The situation is further complicated by the fact that the model is open-source while the user app is heavily censored in its responses.

Key Themes and Ideas:
1. Disruption of the AI Landscape:
- DeepSeek’s emergence has disrupted the established AI landscape, where US tech giants have historically dominated.
- The cost-effectiveness of DeepSeek’s development challenges the belief that expensive, cutting-edge hardware and massive investment are necessary to create top-tier AI models. As Daniel Winter states, “it proves that you can train a cutting-edge AI for a fraction of a cost of what the latest American models have been doing.”
- Stephanie Harry adds, “Until really about a week ago most people would have said that AI was a field that was dominated by the United States as a country and by very big American technology companies as a sector we can now safely say that both of those assumptions are being challenged.”
1. Cost-Efficiency and Innovation:
- DeepSeek was developed for a reported $6 million, a fraction of the hundreds of millions spent by US companies like Open AI and Google. Lisa Soda remarks that this low cost “made investors sit up and panic.”
- DeepSeek’s development was achieved by using older chips, highlighting innovative approaches that optimized efficiency, in a situation where they were unable to use the latest chips due to export controls from the US. As Harry stated: “That design constraint meant that they had to innovate and find a way to make their models work more efficiently…necessity is the mother of invention.”
- This cost-effectiveness challenges US AI companies’ assumptions that more resources and the latest hardware always translate to better AI. According to Harry: “for them they didn’t have to focus on being efficient in their models because they were just doing constantly to be bigger.”
1. Open Source vs. Closed Source:
- DeepSeek’s model is open source which means its code can be accessed, used, and built upon by others, while many US companies except Meta have used closed-source technology. This model promotes collaboration and potentially faster innovation globally. According to Harry: “they have opened up their code, developers can take a look in experiment with it and build on top of it and that is really what you want in the long-term race for AI, you want your tools and your standards to become the global standards.”
- This contrasts with the closed source model favored by many US companies where the internal workings of their technology are kept private. The US approach has created a perception of them trying to build “walls around itself” while China seems to be “tearing them down”, as M. Jang observes.
1. The “Lift Off” Moment:
- The ability of DeepSeek’s model to learn from other AI models, combined with open-source access, leads to the possibility of “liftoff” in the AI industry, where the models can improve rapidly. As Winter said: “once you get AIS learning from AIS they can improve on themselves and each other and basically you’ve got what they call liftoff in the AI industry”
- This could lead to dramatic advancements at an accelerated rate.
1. US Tech Industry Reaction:
- The emergence of DeepSeek has caused major market disruptions, most notably the nearly $600 billion loss in market value for chip giant Nvidia.
- Donald Trump has called the release of DeepSeek a “wake-up call” for US tech companies, underscoring the need for America to be “laser focused” on competing to win.
- Experts suggest that the US tech industry may have become complacent and that this new competition will drive innovation and healthy competition.
1. Data Censorship and Political Implications:
- While the DeepSeek model itself is open-source and uncensored once downloaded directly, the DeepSeek app and website are subject to Chinese government censorship. Users of the app will receive filtered information and cannot inquire about politically sensitive topics like the Tiananmen Square Massacre. This demonstrates that the application of AI is still subject to political influence.
- China’s AI laws and regulations are far stricter than Western ones, especially concerning output, as Lisa Soda mentions: “questions that might pose a threat to National Security or the social order um in China um they can’t really answer these things so”.
1. Geopolitical Implications:
- The development of DeepSeek is viewed as a significant step in China’s strategy of technological self-sufficiency.
- This strategy has deep roots, as Professor Jang states, noting “China has long believed in technological self-efficiency”. China is working to not be dependent on Western technology in many key areas.
- The success of DeepSeek may have inadvertently resulted from US export controls, forcing Chinese companies to innovate. M. Jang notes “US sanctions may have backfired”.
Quotes of Significance:
- Daniel Winter: “They’re rewriting the history books now as we speak because this model has changed everything.”
- Stephanie Harry: “That design constraint meant that they had to innovate and find a way to make their models work more efficiently.”
- Lisa Soda: “it is estimated that the training was around $6 million US dollar which is compared to the hundred of million dollars that the companies right now are putting into these models really just a tiny fraction”.
- M. Jang: “The US is building up its walls around itself China seems to be tearing them down”
- Donald Trump: “The release of deep seek AI from a Chinese company should be a wakeup call for our industries.”
Conclusion:

DeepSeek’s emergence is not just another tech story; it’s a potential paradigm shift in the AI industry. Its success in developing a competitive model at a fraction of the cost of its Western counterparts, combined with its open-source nature, challenges established norms. While questions remain about censorship and political influence, the impact of DeepSeek is clear. It is a “wake up call” for the US tech industry, showing that innovation and access are not solely reliant on vast resources and cutting-edge hardware. It underscores that the AI race is truly global, and the future of AI is far from settled.

DeepSeek AI: A New Era in Artificial Intelligence

FAQ: DeepSeek AI and the Shifting Landscape of Artificial Intelligence
1. What is DeepSeek AI and why is it causing so much buzz in the tech industry? DeepSeek is a Chinese AI startup that has developed a new AI chatbot that rivals leading platforms like OpenAI’s ChatGPT at a significantly lower cost, reportedly around $6 million. This has shocked the industry, especially US tech giants that have invested billions in AI, as it demonstrates that cutting-edge AI can be trained for a fraction of the previous cost. It has also disrupted the AI landscape by using older chips and open-source technology, challenging the dominance of expensive, closed-source models. The app became the most downloaded free app in the U.S., shaking the markets and prompting a significant drop in the value of Nvidia.
2. How did DeepSeek manage to create such a powerful AI model for so little money? Several factors contributed to DeepSeek’s cost-effectiveness. First, they were forced to innovate due to US export controls restricting access to the newest chips. They managed to use less powerful but still capable older chips to achieve their breakthrough. Second, they built their model using open-source technology and distilled their model for greater efficiency, which contrasts with the closed-source approach of many US companies. This allowed them to reduce costs while maintaining high performance, proving that expensive hardware and proprietary code are not always necessary for advanced AI. This “necessity is the mother of invention” approach highlights that design constraints can force innovation.
3. What does the emergence of DeepSeek mean for the AI competition between the US and China? DeepSeek’s emergence has significantly challenged the US’s assumed dominance in AI. It shows that China is not only capable of creating powerful AI models, but also doing so with greater efficiency. This has led to a reevaluation of the investments being made by American tech companies and the overall strategy for AI development. The US is now faced with the reality of a strong competitor, potentially needing to shift from a focus on bigger and more expensive models towards more efficient methods. Also the open source nature of DeepSeek challenges the US tendency to build closed systems.
4. How does DeepSeek’s model compare to other AI chatbots like ChatGPT in terms of performance and capabilities? DeepSeek is comparable in performance to models like ChatGPT, with the capability to reason through problems step-by-step like humans. According to experts, DeepSeek is on par with the best Western models, and in some cases, may even perform slightly better. This demonstrates a significant advancement in Chinese AI technology. While it may have some bugs, this is common in all new AI models, including those from the US. The significant difference lies in the development costs and efficiency of DeepSeek.
5. What are the data privacy and censorship concerns associated with DeepSeek? There are significant data privacy and censorship concerns related to DeepSeek, especially its app. If users download the DeepSeek app they will receive censored information regarding events like the Tiananmen Square massacre and any other topics considered sensitive by the Chinese government. However, the actual AI model itself is open-source and can be downloaded and used without such censorship. This means that individuals and businesses can develop their own applications using the model, but users may receive a very filtered and biased version of information if using the app directly.
6. How does DeepSeek’s open-source approach differ from most US tech companies’ AI strategies? DeepSeek’s open-source approach is a significant departure from the more proprietary, closed-source strategies used by most US tech companies (except for Meta). By making their code available, DeepSeek is allowing for greater collaboration, experimentation, and innovation within the global tech community. This is a key aspect of China’s AI strategy, aiming for their tools and standards to become global standards and for innovation to proceed at a much faster rate by fostering this collaborative nature. This contrasts sharply with the US focus on protecting intellectual property and maintaining a more closed and controlled approach.
7. What impact could DeepSeek have on the future direction of AI development and investment? DeepSeek’s success has profound implications for the future of AI development. It demonstrates that AI advancements do not necessarily require massive investments or reliance on the most cutting-edge hardware. This may lead to a more diverse and competitive landscape, with smaller players entering the market, as it lowers the barrier to entry. It could also push companies to focus on developing more efficient and cost-effective AI models, shifting the emphasis from big and expensive models to more practical and sustainable approaches. This has already caused a re-evaluation of companies like Nvidia and a shock to the market.
8. What are the potential long-term implications of China’s advancements in AI, as exemplified by DeepSeek? China’s advancements in AI, particularly the open-source and low-cost nature of models like DeepSeek, reinforce its commitment to technological self-reliance. In the long term, this could establish a new paradigm in technology development, moving away from reliance on Western tech, as well as showing the power of open source in driving innovation. This could result in a shift in the global balance of power, not only in technology but also in geopolitics. The open source model is an attempt to establish Chinese standards as global standards. This may also force the US to reconsider it’s protectionist approach as it may be hurting themselves in the long run.
Deep Seek: China Challenges US AI Dominance

The sources discuss the competition in the AI industry, particularly between the United States and China, and how a new Chinese AI model called Deep Seek is challenging the existing landscape. Here’s a breakdown:
- Deep Seek’s Impact: Deep Seek, a Chinese AI startup, has developed an AI chatbot that rivals those of major US companies, but at a fraction of the cost [1-4]. This has shocked the tech industry and investors [1-3, 5].
- Cost Efficiency: Deep Seek’s model was developed for approximately $6 million, compared to the hundreds of millions spent by US companies [1, 4, 5]. They achieved this by using less powerful, older chips (due to US export bans), and by utilizing open-source technology [2, 3, 5]. This challenges the assumption that cutting-edge AI requires the most expensive and advanced hardware [2, 5].
- Open Source vs. Closed Source: Deep Seek has made its AI model open source, allowing developers to experiment and build upon it [3, 6]. This contrasts with most US companies, with the exception of Meta, which use closed source technology [3]. The open-source approach has the potential to accelerate the development of AI globally [3, 6].
- Challenging US Dominance: The emergence of Deep Seek is challenging the US’s perceived dominance in the AI field [3]. It’s forcing American tech companies and investors to re-evaluate their strategies and investments [3]. The US might have been complacent with the “Magnificent Seven” companies that had unconstrained access to resources [4].
- AGI and Liftoff: There’s a suggestion that AI is approaching AGI (Artificial General Intelligence), where AI can learn from other AI and improve upon itself [2]. This is referred to as “liftoff” in the AI industry [2].
- US Reactions: The release of Deep Seek has been seen as a “wake up call” for the US [1, 7]. Former President Trump has called for the US to be “laser-focused on competing to win” in AI [1]. Some analysts suggest that US sanctions might have backfired, accelerating Chinese innovation [8, 9].
- Chinese Tech Strategy: The development of Deep Seek aligns with China’s strategy of technological self-sufficiency [8]. China has been working towards this for decades, including in other tech areas such as 5G, social media, and nuclear fusion [8]. The fact that Deep Seek is open source is a significant departure from the US model [8].
- Data and Bias: While the Deep Seek app censors information, the model itself is uncensored and can be used freely [6]. This opens up the possibility for companies worldwide to use and build on the model [6].
- Global Competition: Competition in the AI sector is a global phenomenon, and breakthroughs can come from unexpected places [9]. The focus shouldn’t be on a US versus them mentality, but rather on learning from others [9].
- Impact on AI industry The emergence of Deep Seek is lowering the barrier to entry in the AI market, allowing more players to enter [5]. It remains unclear how the AI industry will be impacted, given that the industry is changing rapidly [5].
In summary, the sources paint a picture of an increasingly competitive AI landscape where the US is facing a strong challenge from China. Deep Seek’s model, developed with less resources and using open-source technology, is forcing a re-evaluation of existing assumptions about AI development and the role of different countries and technologies in the AI race.

Deep Seek: A Chinese AI Chatbot Disrupts the Global AI Landscape

The sources provide considerable information about the Deep Seek chatbot, its impact, and the implications for the AI industry [1-9]. Here’s a comprehensive overview:
- Development and Cost: Deep Seek is a Chinese AI chatbot developed by a startup of the same name [1]. What’s remarkable is that it was developed for around $6 million, a tiny fraction of the hundreds of millions of dollars that US companies typically invest in similar models [1, 6]. This cost-effectiveness has shaken the tech industry [1, 6].
- Technological Approach:Chip Usage: Deep Seek managed to create its model using less powerful, older chips, due to US export bans that restricted their access to the most advanced chips [2, 4]. This constraint forced them to innovate and develop more efficient models [4].
- Open Source: The company built its technology using open-source technology, allowing developers to examine, experiment, and build upon their code [4]. This is in contrast to most US companies that use closed-source technology, with the exception of Meta [4]. The open-source nature of the model allows for global collaboration and development [3, 4, 8].
- Performance and Capabilities:Sophisticated Reasoning: Deep Seek’s model demonstrates sophisticated reasoning chains, which means it thinks through a problem step by step, similar to a human [5, 7].
- Comparable to US Models: The chatbot is considered to be on par with some of the best models coming out of Western countries, including those from major US companies, like OpenAI’s ChatGPT [4, 5, 7].
- Efficiency: Deep Seek’s models are also more efficient, requiring less computing power than many of its counterparts [7].
- Impact on the AI Industry:Challenging US Dominance: Deep Seek’s emergence is challenging the perceived dominance of the US in the AI sector [4]. It has caused US tech companies and investors to re-evaluate their strategies and investments [4, 5]. It has been described as a “wake-up call” for the US [1, 8].
- Lowering Barriers to Entry: The fact that a high-performing AI model was developed at a fraction of the cost has lowered the barrier to entry in the AI market, potentially allowing more players to participate [6].
- Re-evaluation of Existing Assumptions: Deep Seek has challenged the assumption that cutting-edge AI development requires the most advanced and expensive technology and that it must be built using closed-source software [2, 4, 6].
- Competition and Innovation: The competition that Deep Seek is bringing to the AI sector is considered healthy [5]. The company’s success is seen as a sign that breakthroughs can come from unexpected places [9]. It has been noted that the US might have been too complacent with the “Magnificent Seven” companies that have been leading the AI sector and not focused on efficient models [5].
- Censorship and Data Handling:
- App vs. Model: It’s important to distinguish between the Deep Seek app and the underlying AI model. The app censors information on politically sensitive topics, particularly those related to China, like Tiananmen Square or any negative aspects of Chinese leadership [3, 6].
- Uncensored Model: However, the model itself is uncensored and can be downloaded and used freely [3]. This means that companies worldwide can potentially use and build upon this model [3].
- Political and Geopolitical Implications:Technological Self-Sufficiency: Deep Seek’s development aligns with China’s strategy of technological self-sufficiency, which has been a long-term goal for the country [8].
- US Reaction: The US has seen Deep Seek as a competitive threat, and there have been calls for a “laser focus” on competing in the AI sector [1, 8]. Some analysts suggest that US sanctions have backfired, accelerating China’s innovation [8, 9].
- Global Competition: The sources emphasize that the AI competition is a global phenomenon and that breakthroughs can come from unexpected places [9]. Instead of a US vs. them mentality, there is much to be gained by learning from others [9].
In conclusion, Deep Seek’s chatbot is a significant development in the AI landscape. It is not only a high-performing model, but its cost-effectiveness and open-source nature are causing a re-evaluation of existing assumptions about AI development and the competitive landscape.

Low-Cost AI: Deep Seek and the Future of AI Development

The sources highlight the emergence of low-cost AI as a significant development, primarily through the example of the Chinese AI startup Deep Seek and its chatbot [1]. Here’s a breakdown of the key aspects:
- Deep Seek’s Breakthrough: Deep Seek developed a sophisticated AI chatbot that rivals those of major US companies but at a fraction of the cost [1, 2]. This achievement challenges the assumption that cutting-edge AI development requires massive financial investment [3].
- Cost Efficiency:Development Cost: The Deep Seek AI model was developed for approximately $6 million, compared to the hundreds of millions of dollars that US companies typically spend [1, 3]. This difference is a major factor contributing to the shock in the tech industry [1].
- Efficient Resource Use: Deep Seek achieved this cost efficiency by using less powerful, older chips, and by using an open source approach [2, 4].
- Distillation of Models: Deep Seek has used techniques to distill and create more efficient approaches in the training and the inference stage [3].
- Challenging Assumptions: The low cost of Deep Seek’s model has challenged the prevailing assumptions about AI development in several ways:
- Hardware Requirements: It demonstrates that high-performing AI doesn’t necessarily require the most expensive and advanced hardware [4]. The fact that Deep Seek could build its model using less powerful chips is a major revelation [2, 4].
- Closed Source Approach: Deep Seek’s use of open-source technology, rather than closed source, has also challenged the idea that AI development must be proprietary. [2]
- Barriers to Entry: The fact that Deep Seek built a sophisticated AI model for so little money has lowered the barrier to entry in the AI market [3]. It suggests that more players can now participate in AI development, potentially democratizing access to the technology [3].
- Impact on the AI Industry:Re-evaluation: The success of Deep Seek has forced the US and other players to re-evaluate their strategies and investments in AI [2, 5].
- Competition: The emergence of low-cost AI models is intensifying competition in the AI sector [1, 6]. This has been noted as a positive thing because it can force companies to focus on efficiency rather than relying on large amounts of funding [5].
- Open Source Acceleration: Deep Seek’s open-source model has the potential to accelerate AI development globally, as it enables collaboration and innovation [2, 4].
- Global Implications:Technological Self-Sufficiency: China’s development of low-cost AI is seen as part of its broader strategy of technological self-sufficiency and reducing its reliance on Western technology [6].
- Potential for other countries: The possibility that models can be built at lower cost opens opportunities for other countries, including Europe, to develop their own AI models [4, 7].
- Global Benefit: Rather than an “us versus them” scenario, the sources suggest that the world has much to benefit from a global AI competition with breakthroughs coming from unexpected places [6, 8].
- Censorship and Data Handling: While the Deep Seek app censors information, the actual underlying model is uncensored [7]. This means that even if the average user will receive filtered information, the model itself may be used by companies and developers globally.
In summary, the sources present low-cost AI as a disruptive force in the industry, challenging established norms and assumptions, and changing the competitive landscape significantly. Deep Seek’s model demonstrates that cutting-edge AI can be developed at a fraction of the cost previously assumed, using more efficient methods, and open source technology. This development has significant implications for the future of AI and the way it is developed and deployed globally.

Deep Seek: A Wake-Up Call for US AI

The sources describe the reaction of the US tech industry to the emergence of Deep Seek’s AI chatbot as one of shock, concern, and a need for re-evaluation [1-5]. Here’s a breakdown of the key aspects of that reaction:
- Wake-up call: The release of Deep Seek has been widely characterized as a “wake-up call” for the US tech industry [1, 5]. It has forced American companies and investors to recognize that their dominance in AI is being challenged by a Chinese competitor that has developed a comparable model at a fraction of the cost [1, 3, 5].
- Re-evaluation of strategies and investments: Deep Seek’s low-cost AI model has led to a re-evaluation of strategies and investments in the US tech sector. The sources suggest that the US may have been too focused on pouring massive amounts of money into AI development without focusing on efficient models, and may have become complacent with the “Magnificent Seven” companies that were leading the AI sector [3, 4].
- Market impact: The news of Deep Seek’s AI capabilities has significantly impacted the stock market, with Nvidia, a major chip manufacturer for AI, experiencing a massive loss in market value [1, 2]. This is because Deep Seek has demonstrated that cutting-edge AI can be built using less powerful and cheaper hardware [2, 3]. This suggests that the projections and valuations of companies involved in AI might have to be revised to account for the possibility of low-cost AI alternatives [2].
- Challenging assumptions: The US tech industry is having to confront the fact that its previous assumptions about AI development are being challenged. The belief that high-performing AI requires the most expensive and advanced hardware, and that it must be developed using closed source software, are being questioned [2, 3, 6]. The fact that a Chinese company developed a very sophisticated AI model for around $6 million has been a major shock to US companies that have invested hundreds of millions of dollars in AI development [1, 6].
- Competition and innovation: The emergence of Deep Seek is seen as a catalyst for healthy competition in the AI sector [3, 4]. The US is now facing a strong competitor and has to “be laser-focused on competing to win” [1]. This competition could lead to further innovation and different approaches to AI development that might benefit the world [7].
- Open Source vs Closed Source: The fact that Deep Seek is open source, in contrast to the proprietary approach of most US companies, is a significant point of discussion [3]. There is a suggestion that US companies may have to consider making their own models open source to accelerate scientific exchange in the US [2].
- US Government response: The sources mention that former President Trump has called the emergence of Deep Seek a “wake-up call” [1]. Trump has also announced a $500 billion project to build AI infrastructure, which could be a reaction to this development [1, 3].
- Possible protectionist reactions: There is some speculation about the possibility of protectionist reactions from the US, but one source argues that “a zero sum I win you lose Cold War mentality is really unproductive” [8].
In summary, the US tech industry’s reaction to Deep Seek’s AI chatbot is one of concern and a realization that it needs to adapt to a new, more competitive AI landscape. The low-cost AI model has challenged existing assumptions about technology development and is forcing US companies to rethink their strategies, investments, and approaches to AI innovation.

Deep Seek: Redefining AI Development

The sources offer a detailed perspective on AI development, particularly in light of the emergence of Deep Seek and its low-cost AI model. Here’s a comprehensive discussion:
- Cost of Development: The most significant aspect of recent AI development, highlighted by Deep Seek, is the dramatic reduction in cost. Deep Seek developed a sophisticated chatbot for approximately $6 million, a fraction of the hundreds of millions typically spent by US companies [1, 2]. This development has challenged the assumption that cutting-edge AI requires massive financial investment [2].
- Efficient Resource Use: Deep Seek’s cost-effectiveness stems from a few key factors:
- Older Chips: They utilized less powerful, older chips, in part due to US export restrictions, demonstrating that advanced hardware is not necessarily essential for cutting-edge AI [3, 4].
- Open Source: Deep Seek’s open-source approach to development contrasts with the closed source approach used by most US companies [4]. The open-source strategy allows for community contribution and can potentially accelerate innovation.
- Model Distillation: They employed techniques to distill the model, making it more efficient during both training and inference stages [2].
- Challenging Conventional Wisdom: Deep Seek’s success has challenged several conventional assumptions in AI development [2]:
- Hardware Dependence: The notion that high-performing AI requires the most advanced and expensive hardware is being questioned [3, 4].
- Proprietary Models: The idea that AI development must be proprietary is being challenged by Deep Seek’s open-source model [4].
- High Barriers to Entry: The development of a sophisticated AI model for just $6 million has lowered the barrier to entry in the AI market, suggesting that more players can now participate in AI development [2].
- Impact on the AI Industry:
- Re-evaluation: Deep Seek’s emergence has prompted a re-evaluation of strategies and investments in the US and other places [4, 5].
- Competition: The increased competition is seen as a positive force that will drive innovation and efficiency in the industry [5].
- Global Development: Deep Seek’s open-source model may facilitate faster development of AI globally by enabling collaboration and building on existing work [4].
- Technological Self-Sufficiency: China’s development of Deep Seek is a part of its strategy for technological self-sufficiency. China has long strived for technological independence [6]. The sources note that China is quickly catching up and even pulling ahead in several advanced technology areas [6].
- Open Source vs Closed Source:
- Deep Seek’s Approach: Deep Seek’s open-source model allows developers to take a look, experiment with it, and build upon it [4].
- US Approach: Most US companies use closed-source technology, with the exception of Meta [4]. It has been suggested that the US might need to adopt open-source strategies to accelerate development [3].
- US Reaction:
- Wake-up Call: Deep Seek is viewed as a “wake-up call” for the US tech industry [1, 4].
- Investment Reassessment: There is a need for US companies to be “laser-focused on competing to win” [1], and to re-evaluate their investments and strategies [4].
- Competition: It’s seen as a healthy challenge that could lead to more innovation and different approaches to AI development [5].
- Global Competition: The sources make it clear that AI development is now a global competition with potential for breakthroughs to occur in unexpected places [7]. Rather than an “us versus them” mentality, the world has much to benefit from a global collaboration and competition [7].
In conclusion, the sources show that the landscape of AI development is changing rapidly. The emergence of low-cost models like Deep Seek is forcing a re-evaluation of established norms. The focus is shifting towards more efficient development, open-source models, and a global approach to innovation. The future of AI is increasingly looking like a global competition with lower barriers to entry and the possibility of new and unexpected players leading the way [2].

Chinese AI app DeepSeek shakes tech industry, wiping half a trillion dollars off Nvidia | DW News

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 19, 2025
Global Markets and Economic Outlook
This Bloomberg television segment features discussions on several key economic and financial topics. Market analysts weigh in on the impact of the Federal Reserve’s decisions, the implications of a potential probe into a Chinese AI startup’s data practices, and the outlook for the tech sector. Investment strategists at BlackRock offer their perspective on global market trends, emphasizing the importance of selectivity and diversification within portfolios. Further segments examine the growing private markets sector, particularly the opportunities for wealth management, as well as the potential effects of President Trump’s policies on various sectors, including energy and commodities. Finally, the impact of LVMH’s performance on the luxury goods market is analyzed.

Financial News Analysis Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What is the central accusation against the Chinese AI startup DeepSeek, and what technology does the allegation involve?
2. How did the market initially view DeepSeek’s AI model development, and what potential evidence could challenge that view?
3. Why was ASML’s earnings beat significant for the tech sector, and what product of theirs is driving this demand?
4. According to Ursula from BlackRock, what three factors support U.S. economic exceptionalism, and which one is facing the most current scrutiny?
5. What is BlackRock’s view of European markets and where are they seeing investment opportunities?
6. How are wealthy individuals in Europe increasingly viewing private markets, and what is driving that perspective?
7. What is the regulatory perspective in France regarding investor access to private market opportunities?
8. How does Jeff Currie characterize the current state of oil production in the U.S., and what is the relationship between oil, gas and liquids?
9. According to Jeff Currie, what are the three main market drivers to watch, and how is the current supply chain fragility impacting the energy market?
10. Why are investors currently favoring real assets and what happened in late 2022 to change investment strategies?
Answer Key
1. The accusation against DeepSeek is that they may have used “distillation,” accessing the OpenAI API to scrape data beyond what is allowed, essentially building their model on OpenAI’s. This involves accessing and utilizing OpenAI’s data without proper authorization.
2. The market initially viewed DeepSeek as an impressive startup that built a model comparable to OpenAI on a very limited budget without the latest GPUs, but some suspect they may have had a head start by scraping data from OpenAI’s API, thereby undermining their success.
3. ASML’s earnings beat gave reassurance to the tech sector and indicated a rebound with high demand for their $300 million chipmaking devices essential for chip production, particularly in AI.
4. The three arrows that support U.S. exceptionalism are strong economic growth, sticky inflation, and tech leadership. The technology sector’s power is currently facing the most scrutiny.
5. BlackRock is taking a contrarian view of European markets and has seen some clients warming up. They prefer quality spreads with European rates and Euro high yields.
6. Wealthy European individuals are looking to diversify into private markets to access new opportunities and diversify, and away from traditional liquid assets, increasing allocations up to 50%.
7. The French regulators have recognized the benefits for investors to access opportunities in private markets and the need for investors to move beyond just public fixed income and equities into longer-term investments.
8. Jeff Currie says U.S. oil production growth is slow and is not keeping pace with demand. The US is producing more gas and liquids than oil which limits growth.
9. The three drivers are supply chain fragility, low inventories and the dollar. The supply chains are fragile with evidence of supply issues particularly in energy and renewables.
10. Investors now prefer real assets because the market has changed, particularly after the cost of capital went up. The zero-interest rates allowed them to leverage both bonds and equities, but investors are now making choices based on the pressures of underinvestment.
Essay Questions

Instructions: Develop a well-structured essay response to each of the following questions.
1. Analyze the interplay between technological innovation (specifically in AI and chip manufacturing), market dynamics, and geopolitical tensions as reflected in the news excerpts. How do these factors interact to shape investment strategies and industry outlooks?
2. Discuss the shift in investor focus from traditional public markets to private markets and real assets, including the drivers behind this change and the challenges and opportunities it presents for wealth management.
3. Explore the Trump administration’s policies and their potential effects on both domestic and international markets, including tariffs, spending freezes, and energy sector initiatives. How do these actions align with or diverge from established economic practices?
4. Evaluate the energy market conditions, including oil production, global demand, and the potential impact of AI and data center energy needs. How do these factors create vulnerabilities and influence investment decisions in the energy sector?
5. Analyze how the concept of energy transition is being impacted by new geopolitical considerations, regulatory shifts, and market factors. How do those considerations influence the pace and priorities of energy transition efforts in the US and Europe?
Glossary of Key Terms
- AI Chip Making: The design and manufacturing of specialized integrated circuits (chips) optimized for artificial intelligence applications.
- API (Application Programming Interface): A set of rules and specifications that software programs can follow to communicate with each other.
- Distillation (in AI Context): A process that involves accessing a large language model’s API in order to extrapolate large amounts of data, often beyond permitted use.
- U.S. Exceptionalism: The belief that the United States is unique or different from other countries, particularly regarding economic strength.
- S&P Equal Weight: A stock market index where each company’s stock is given the same weight, rather than weighted by market cap.
- MAG Seven: Refers to seven high-performing tech stocks – Microsoft, Apple, Google, Amazon, Nvidia, Tesla and Meta.
- ECB (European Central Bank): The central bank of the Eurozone countries, responsible for monetary policy.
- Quality with Carry: An investment strategy that seeks high-quality fixed income investments that also offer a positive carry (income).
- Alpha: A measure of risk-adjusted performance for an investment. Alpha is used to measure how well an investment is performing above or below a specific benchmark.
- Granularity: The level of detail or specificity, particularly in investment strategies or market analysis.
- High Net Worth Individuals: Individuals with a large amount of assets or money.
- 60/40 Portfolio: A traditional investment allocation in which 60% of the portfolio is invested in stocks and 40% is invested in bonds.
- Private Markets: Markets where investments, such as private equity or real estate, are not publicly traded on exchanges.
- Alternative Investments: Assets that are not traditional stocks, bonds or cash, such as private equity, real estate and commodities.
- Real Assets: Tangible or physical assets such as real estate, infrastructure and commodities.
- OPEC (Organization of the Petroleum Exporting Countries): A group of countries that coordinate oil production and pricing policies.
- Time Spreads: The price difference between contracts for different delivery dates, often in commodities markets.
- Grid (Power Grid): The interconnected network for delivering electricity from suppliers to consumers.
- Supply Chain Fragility: The susceptibility of supply chains to disruptions, including geopolitical tensions, weather events or unforeseen supply/demand issues.
- Leverage: The use of borrowed capital to increase the potential return on investment.
- P/E Ratio (Price to Earnings Ratio): A valuation ratio that compares a company’s stock price to its earnings per share.
Global Market and Economic Trends Briefing

Okay, here’s a detailed briefing document summarizing the key themes and ideas from the provided Bloomberg transcript:

Briefing Document: Global Market & Economic Trends

Date: October 26, 2024

Sources: Bloomberg Television Transcript Excerpts

Executive Summary:

This briefing document summarizes key market trends and economic developments discussed in recent Bloomberg broadcasts. The main topics covered include: an investigation into potential data theft by a Chinese AI startup (Deepseek), the robust performance of ASML amidst the AI chip boom, U.S. economic exceptionalism and the state of global markets, Trump administration policies and potential impacts on the economy, the growing importance of private markets, and energy market dynamics in a changing global landscape. There is also a mention of the luxury goods market.

Key Themes & Ideas:
1. AI & Technology:
- Deepseek Investigation: Microsoft and OpenAI are investigating Deepseek, a Chinese AI startup, for allegedly “scraping” data from OpenAI’s API to build its model. This process is referred to as “distillation.” This raises questions about the legitimacy of Deepseek’s rapid progress and challenges the narrative that they achieved similar performance to OpenAI on a limited budget.
- Quote: “It is a rumbling which would be if Microsoft and OpenAI said they found evidence that Deepseek — the term is distillation, like going and accessing the OpenAI API and basically scraping a lot more data than OpenAI allows. Effectively building the model off the backs of OpenAI’s model.”
- ASML’s Strong Performance: ASML, a key supplier of chip-making equipment, beat earnings expectations, fueled by demand from the AI sector. This provides reassurance to the tech sector and shows that orders are remaining strong despite market anxieties.
- Quote: “ASML cells $300 million device, critical to making these chips. It is not like orders will stop on a dime for the company.”
- AI Energy Demands: The impending demand for energy created by AI is significant, with data centers requiring vast amounts of power. This growth is projected to be far larger than crypto.
1. Market & Economic Outlook:
- U.S. Exceptionalism: BlackRock believes the U.S. market continues to be exceptional, driven by strong economic growth, sticky inflation, strong earnings, and technology leadership. This remains the core investment thesis.
- Quote: “The thesis about U.S. exceptionalism is founded on three arrows. Strong economic growth, sticky inflation. Strong earnings, very high bar, but thus far has been met and we will see how the week develops. Then, technology and leadership there.”
- Selective Investing: While the U.S. remains a focal point, a selective approach is crucial across markets, including Europe. Granularity in portfolios is recommended.
- Fed and ECB Policies: The Federal Reserve is expected to hold steady on interest rates, while the ECB may cut rates twice by mid-year, but the path afterwards is still uncertain.
- Volatility: The market is currently volatile and investors should consider being nimble in their instruments.
- Tariffs: The potential for increased trade frictions due to tariffs is a concern.
- Europe Contrarian View: There is potential upside for Europe, especially as political situations in countries stabilize, despite weaker earnings. Investors are beginning to show interest in the region after a period of low confidence.
- China Uncertainty: There is little investor interest in China as the future of the market is uncertain due to policies and lack of clarity on trade tensions.
- Quality and Carry: A quality and carry investment strategy is favored due to the US outlook. In Europe, high yields are favored.
1. Trump Administration Policies:
- Spending Freeze: The Trump administration has implemented a temporary freeze on federal grants and loans, causing confusion and panic. There was rapid clarification that this does not affect essential funding like medicaid or social security.
- Executive Power: There is debate regarding the executive power of the president and the ability to implement significant changes without congressional approval.
- Return to Office: The administration is pushing for government employees to return to the office, offering buyouts to those who don’t want to work in person.
- Tariffs: President Trump has threatened widespread tariffs on steel, copper and aluminum.
- OPEC: Trump is calling on OPEC to lower oil prices, while OPEC is planning on increasing output in April.
1. Private Markets & Wealth Management:
- Democratization of Alternatives: There is growing demand for private market investments from high-net-worth individuals, driven by a desire for diversification and access to opportunities not available in public markets. This push towards democratisation comes with the increasing awareness of the private market growth in comparison to public markets.
- Quote: “Really, we are trying to help private investors get access to opportunities they have not been able to get access to before and that is trying to give them better diversification in the portfolios and moving away from those traditional days where alternatives used to be a very small pocket of your portfolio to something where we see potentially wealth investors having allocations up to 50% in private markets.”
- Regulatory Support: Regulators are becoming more supportive of investors accessing private market opportunities for long-term investments.
- Diversification: Investors are turning to private markets to escape the volatility of public markets.
- Liquidity: There is a large opportunity to tap into the wealth of Europe’s high-net-worth individuals by offering better liquidity.
1. Energy & Commodities:
- Oil Supply Tightness: Sanctions on Russia are impacting oil supplies, and the market is expected to get tighter in the near term.
- OPEC Impact: The increase in OPEC production planned for April has potential to impact oil prices, but the market could experience deficits before then.
- U.S. Production: U.S. oil production growth is slowing down due to geological factors. The ability to grow oil is difficult, and the production output is at levels similar to 2019 pre-covid, despite increases over the last years.
- Range-Bound Oil: Oil prices have been relatively range-bound for the last 30 months.
- Financial Investor Absence: Financial investors have largely lost interest in oil, and it would require significant market movement to encourage them to invest.
- Supply Chain Fragility: Supply chain fragility is a major issue in the energy sector, particularly with renewables.
- Energy Transition Motivation: The motivation for energy transition is shifting from fear of running out of oil to energy security and national concerns, which could lead to faster progress.
- Real Asset Opportunities: Real assets, such as infrastructure, real estate, and managed futures, are becoming more attractive to investors.
1. Luxury Goods Market:
- LVMH Disappointment: LVMH has not performed as well as expected, casting doubts on the prospects of a quick recovery for the sector.
- US Sales: The US market has been the most active in the luxury market, with China and other markets slower to recover.
- Potential Break-Up: A potential break-up of the LVMH group has been mentioned, as the valuation is being impacted by sectors such as wine and spirits. A pure luxury business might be more profitable.
Conclusion:

The global economic landscape is complex and dynamic. The technology sector continues to drive significant change, but faces questions around data ownership and energy demands. Geopolitical factors, particularly policies from the Trump administration and international conflicts, are impacting trade and energy markets. Investors need to be selective and adaptable, considering both public and private markets and alternative assets.

This briefing document is intended to provide a snapshot of current themes and should be used in conjunction with further research and analysis.

Global Tech, Finance, and Energy Trends

FAQ
- What is the investigation into DeepSeek about, and why is it significant? DeepSeek, a Chinese AI startup, is under investigation by Microsoft and OpenAI for potentially acquiring unauthorized data from OpenAI’s technology. The concern is that DeepSeek may have used a technique called “distillation” to scrape large amounts of data through the OpenAI API, exceeding the allowed limits. This could have enabled them to build their model on the foundation of OpenAI’s data without permission, undermining the perception that they developed their technology from scratch on a shoestring budget. If true, this could be a serious breach of terms of service and potentially intellectual property theft.
- How are the recent earnings of ASML affecting the tech sector, especially in chip manufacturing? ASML, a company critical to making advanced chips with its $300 million devices, recently reported earnings that beat expectations, which has provided a boost to the tech sector. The positive news has reassured investors, erasing some losses that the sector has been facing. This performance suggests that despite potential shifts in the market, the demand for essential chipmaking technology is still strong, signaling that orders might remain steady for ASML and its suppliers, particularly amid the ongoing growth in AI.
- What are the main factors weighing on investors’ minds according to BlackRock? Several factors are weighing heavily on investors, including the Federal Reserve’s decisions on interest rates, potential trade frictions from Trump’s tariffs, the ongoing developments in AI, and corporate earnings. BlackRock highlights a focus on U.S. exceptionalism, driven by strong economic growth, high earnings, and technological leadership. However, the recent volatility in chipmakers and power sectors has created some uncertainty. Investors are also closely monitoring global factors like the sentiment and liquidity conditions in the U.K., potential back to back rate cuts in Europe, and the impact of a strengthening U.S. dollar.
- How are investors approaching the current market volatility, particularly with the competing forces affecting the U.S. dollar? Investors are advised to “stay the course” and adhere to a long-term strategy. While maintaining their core strategy, they are also seeking opportunities by using granular instruments to capture alpha (excess returns). There’s recognition that, in this volatile environment, they must use nimble tools to implement their strategy. The market has been showing some divergences. For instance, the top S&P stocks are decoupling from the rest of the S&P. This calls for greater granularity in portfolio construction.
- What are the main themes in private wealth management and why is there interest in private markets? There is growing recognition that a significant number of large companies are now in the private market, so wealthy investors are seeking access to private markets to diversify their portfolios and benefit from potential opportunities in this space. Private equity firms are now focusing on helping wealthy investors move beyond traditional portfolios (fixed income and public equities) that are limited in terms of liquidity and long-term returns. Wealth managers and other platforms are seeking to provide their clients with access to these alternative investments.
- How is the Trump administration impacting government spending and what are the consequences of its policies? The Trump administration has implemented several broad directives, including a temporary freeze on government spending, which has led to unintended consequences like difficulties in accessing federal payment portals for some states. There’s also concern that these policies will hurt research, including crucial tech and AI projects. Additionally, the administration is attempting to reshape the government by offering buyouts to federal workers who do not want to return to the office, while also removing government oversight from some agencies.
- What is the current outlook for oil markets, and what are the factors influencing oil prices? The oil market is facing several complexities, including sanctions on Russia, potential tariffs on Canadian oil, and the upcoming increase in OPEC output. The market seems tight because of low global inventories and some evidence that sanctions on Russia may be affecting supply. While the current price of oil has been range-bound for some time, potential supply constraints and the seasonality of demand could lead to price volatility. Additionally, financial investors are absent from the market, adding to the complexity.
- How is the increasing demand for AI impacting the energy sector, particularly regarding data centers? The surge in AI development is expected to substantially increase the demand for energy, particularly with the energy needs of AI-driven data centers, on top of existing demand from crypto and cloud computing. Currently, the power sector makes up 20% of global energy use. A relatively small growth in AI-related power demand (2-3%) could strain existing energy infrastructure, especially since there has been underinvestment in the power grid. This is combined with supply chain vulnerabilities and the intermittency of renewable energy sources. The energy transition will likely continue, although motivations for it may shift towards energy security concerns.
AI Chipmaking: Market Trends and Risks

The sources discuss AI chipmaking in the context of several different angles, including company performance, market trends, and potential risks. Here’s a breakdown:
- ASML’s performance: ASML, a company that sells devices critical to making chips, beat earnings estimates, which is seen as a positive sign for the tech sector and the AI chip-making industry [1, 2]. The company sells a $300 million device that is critical for making chips, and while there was some market concern that orders for these devices would stop, that has not happened [3].
- Demand for AI chips: The demand for ASML’s chipmaking machines is being driven by the AI boom [2].
- DeepSeek investigation: There is an investigation into a Chinese AI startup called DeepSeek, which is suspected of obtaining unauthorized data output from OpenAI technology [1].
- DeepSeek is being investigated for potentially “scraping” data from the OpenAI API, which would mean building their model off of OpenAI’s model [1]. This is referred to as “distillation” [1].
- There is speculation that DeepSeek may have had a “head start” by using data from OpenAI [1]. This could undermine the thesis that they were able to build something on par with OpenAI on a shoestring budget without using the latest GPUs [1].
- Potential impact of DeepSeek on the market: If DeepSeek did use OpenAI data, it could undercut the idea that the company built their model on a small budget without the latest GPUs [1]. There is no suggestion that Alibaba has done the same thing, but it is a well-capitalized company that one might expect a model to come from [1].
- Broader market trends:
- The technology sector is considered a key component of U.S. market exceptionalism [3].
- There is a focus on chipmakers and the power they hold within the market [3].
- The market is interested in the potential benefits of blending top 20 and S&P equal weight stocks [3].
- Energy consumption: The energy demand for AI is yet to be seen, and current data center demand is mostly driven by crypto mining [4, 5]. The potential growth in AI could have a big impact on the demand for energy [5].
- Impact of government policies: Government actions, such as potential tariffs, could affect the supply chains for the tech industry [2, 6]. Additionally, a temporary freeze on federal grants and loans has sparked panic in the tech sector because it could affect research and AI projects [2, 7].
In summary, the AI chip-making industry is experiencing high demand, as shown by ASML’s earnings. However, there are also potential challenges like the DeepSeek investigation and uncertainties around energy demand and government policies.

Fed Holds Steady Amidst Market Uncertainty

The sources discuss the Federal Reserve’s (Fed) decision in the context of its potential impact on markets and the economy [1, 2]. Here’s what the sources say about the Fed’s decision:
- Expected Action: The Fed is expected to keep interest rates steady [2-4]. The sources suggest that the Fed will likely provide limited forward-looking guidance [3].
- Market Impact: The Fed’s decision is a key factor influencing investor sentiment and market volatility [2, 3]. The market experienced volatility three weeks ago due to the Fed’s actions, which also affected U.K. gilts [3].
- Broader Economic Context:
- The Fed’s decision is taking place during earnings season [2].
- The U.S. economy is experiencing strong growth and sticky inflation [2].
- The sources highlight the theme of U.S. exceptionalism, with technology leadership being a key component [2].
- Comparison to ECB: The European Central Bank (ECB) is expected to make back-to-back rate cuts in the middle of the year [3]. However, the future direction of the ECB is more uncertain than that of the Fed [3].
- Uncertainty and Competing Forces: There are many competing forces that create uncertainty for the market, including potential tariffs, the possibility of Donald Trump wanting a lower dollar, and regulatory uncertainty [2, 3, 5].
- Investment Strategy: Despite the uncertainty, financial advisors recommend investors stay the course [3]. They also suggest that there are opportunities to capture alpha through the use of granular investment instruments [3].
- Impact of Trump Administration: The Trump administration’s actions, such as a temporary freeze on federal grants and loans, could impact research and AI projects, potentially adding another layer of uncertainty to the markets [4]. There are also concerns about the level of executive power, especially in relation to fiscal matters that typically fall under the purview of Congress [6].
In summary, the Fed is expected to maintain steady interest rates, but its decision is taking place amid market volatility and uncertainty due to other factors. The Fed’s decision is an important consideration for investors as they navigate these market conditions.

The Tech Sector: Growth, Challenges, and Uncertainty

The sources provide several insights into the tech sector, covering company performance, market trends, and potential challenges. Here’s a breakdown of the key themes:
- ASML’s strong performance: ASML, a company that produces chip-making devices, has seen a surge in orders, particularly due to the demand created by the AI boom [1, 2]. This indicates that the chip manufacturing part of the tech sector is currently experiencing growth and increased demand [1, 2]. The company sells a $300 million device critical for making chips [3]. The market was concerned that orders for these devices would stop, but that did not happen [3].
- AI and Chipmaking:
- The demand for ASML’s chipmaking machines is being driven by the AI boom, indicating a strong link between the AI sector and chip manufacturing [1, 2].
- The sources also note that there is an investigation into DeepSeek, a Chinese AI startup, for potentially using unauthorized data from OpenAI [1, 2]. This could undermine the idea that the company was able to build an AI model on a small budget without the latest GPUs, as it may have had a “head start” using data from OpenAI [1].
- The tech sector is a key component of what is referred to as “U.S. exceptionalism”, which is based on three pillars: strong economic growth, sticky inflation, and technology leadership [3].
- Market Trends:There is a focus on chipmakers and the power they hold within the market [3].
- The market is interested in the potential benefits of blending top 20 and S&P equal weight stocks, which reflects a nuanced approach to investing within the tech sector [3].
- The tech sector is experiencing some volatility in the market [3].
- The sources suggest that the technology sector is a key driver of US market performance [3].
- Energy Consumption: The energy demand for AI is yet to be fully realized [4]. Currently, data center energy demand is mainly driven by crypto mining [4]. The potential growth of AI could significantly increase the demand for energy, which is a challenge to meet given underinvestment in power grids [4, 5].
- Government policies:Government actions, such as potential tariffs, could impact the supply chains for the tech industry [3, 6].
- The Trump administration’s temporary freeze on federal grants and loans has caused concern in the tech sector because it could affect research and AI projects [2, 7].
- There are also concerns about the level of executive power, particularly regarding fiscal matters that usually fall under the control of Congress [7].
- Potential for Disruption: There is a possibility that the trend of public companies being bought out by private investors could lead to less accessibility to these companies [8].
In summary, the tech sector is experiencing a surge in demand related to AI and chipmaking [1, 2]. This growth is coupled with new challenges that include: investigations into unauthorized AI data usage, the rising demand for energy, and the potential impacts of governmental policies [1, 2, 4]. These factors contribute to volatility and uncertainty in the tech sector, which is nevertheless considered a key driver of U.S. market performance [3].

Private Markets in Wealth Management

The sources discuss private markets in the context of wealth management, investment strategies, and the broader financial landscape. Here’s a breakdown of the key themes:
- Increased Access for Wealth Investors: There’s a notable trend of wealth investors seeking access to private market opportunities [1]. This is driven by a recognition that many large companies are now in the private world, and investors want to access those types of investments [1]. This is a shift from traditional portfolios where alternatives used to be a small part of the portfolio to a potential allocation of up to 50% in private markets for wealth investors [1].
- Democratization of Alternatives: The move towards private markets is seen as a “democratization of alternatives,” where private banks, wealth managers, and platforms are seeking better access to these opportunities [1, 2]. This is because there used to be 8000 public companies in the US, but now there are only 4000 [1]. The majority of large companies are now in the private world [1].
- Regulatory Support: Regulators are recognizing the benefits for investors to access private markets, especially for long-term investments [2]. They acknowledge that investors shouldn’t necessarily be limited to daily liquid mutual funds or 100% liquidity [2]. This indicates a shift in the regulatory environment that supports the growth of private markets.
- Diversification: Private markets are being looked at as a way to diversify portfolios, particularly as investors seek to step away from volatile public markets [2, 3]. The high correlation experienced in public markets in 2022 has driven the need to diversify into less liquid markets [3]. Investors are trying to move away from potentially liquid markets to reduce volatility and the promise of private markets is to offer excess returns for the unit of risk and uncertainty [3].
- Replacing Public Market Allocations: Private markets are being considered as replacements for traditional public market allocations, particularly in fixed income and equity [3]. One example is a diversified public market strategy being offered as an equity replacement [3].
- Impact of Public Market Volatility: The volatility in public markets is driving interest in private markets [3]. For example, the drop in Nvidia’s stock is noted as one of the examples of the need to diversify via markets that are not as liquid [3].
- Types of Private Assets: The sources note that investors want to be in “real assets,” which include liquid private markets, infrastructure and real estate. They are also interested in liquid alternatives such as managed futures [4].
- European Interest: There’s growing interest in Europe for private markets, with nearly $3 billion in inflows year-to-date [5]. This suggests that investors are warming up to the idea of private markets in Europe despite some concerns about political and economic stability in that region [5, 6].
In summary, private markets are gaining traction as investors seek diversification, higher returns, and access to a broader range of investment opportunities. The trend is supported by regulatory changes and a recognition of the importance of private companies in the current financial landscape. The volatility in public markets is also driving interest in private markets.

Luxury Stock Market Analysis

The sources discuss luxury stocks in the context of market performance, consumer behavior, and global economic factors. Here’s a breakdown of the key points:
- LVMH’s Performance: LVMH, a major luxury goods company, experienced a share slump after its fashion and leather goods sales fell in the fourth quarter [1]. This was considered a disappointment, as the company did not outperform expectations to the degree that other companies in the sector did [1]. Although LVMH performed slightly better than analysts’ expectations, the market had raised its expectations and this was not considered good enough [1].
- Consumer Base:The U.S. market is currently the primary driver of luxury sales [1]. There appears to be a correlation between the U.S. market and Bitcoin, and sales have seen some recovery since the election [1].
- The Chinese market is not as strong as it once was, which is affecting the luxury sector [1, 2].
- Millennials are slowly returning to the market, but their impact is not yet significant [1].
- Factors Influencing Luxury Stocks:
- Chinese Economy: The performance of luxury stocks, particularly in Europe, is tied to the Chinese economy. It is difficult to determine if fluctuations in the market are due to the Chinese economy or not [2].
- Tariffs: There is concern about the possibility of tariffs and their potential impact on luxury goods companies [2, 3].
- Manufacturing: There are suggestions that luxury companies could increase manufacturing in the U.S. given the current economic policies being implemented there [3]. Some luxury companies already manufacture leather goods in Texas and California [3].
- Geopolitical Tensions: The stabilization of politics in individual countries, coupled with the potential resolution of geopolitical tensions, may positively influence luxury stock performance [2].
- Earnings: While earnings in Europe are not as strong as in the U.S., the bar for earnings is lower in Europe, which could lead to potential upside surprises [2].
- Valuation: LVMH’s valuation may be penalized by the company including wines, spirits, and duty-free businesses in its portfolio. There is a possibility that concentrating on “pure luxury” could unlock value, as LVMH’s price-to-earnings (P/E) ratio is around 55 [3].
- Market Sentiment:
- Luxury stocks experienced a downturn, with luxury being “shot down” in the market [2].
- There is a general sense of uncertainty and volatility in the luxury sector, with competing forces making it difficult to predict future performance [2, 4].
- Selectivity is the best approach when investing in the European market, where luxury stocks are particularly difficult to read [2].
- Comparison to Other Sectors: The sources contrast the performance of luxury stocks with the tech sector and the energy sector.
In summary, luxury stocks are currently facing challenges due to a mix of factors, including weaker Chinese demand, uncertainty in the European market, and potential shifts in manufacturing and trade policies. The U.S. market is a key driver of sales in the sector. The performance of LVMH, a bellwether in the luxury industry, suggests that the sector is facing difficulties, and selectivity is necessary when considering investments.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
January 31, 2025
DeepSeek’s AI Disruption: Open Source, Efficiency, and Global Impact
DeepSeek, a relatively unknown Chinese AI company, has disrupted the AI industry by releasing Janice Pro, a powerful open-source multimodal AI model that rivals leading models like OpenAI’s Dolly 3, at a fraction of the cost. This achievement, coupled with DeepSeek’s R1 language model which matches GPT-4’s performance, has sent shockwaves through the tech industry, impacting stock prices and prompting debate about AI development strategies and US export controls. The success of DeepSeek, however, is not without controversy, raising concerns about its ties to the Chinese government and the potential security risks associated with its open-source approach. The incident also caused a temporary service disruption due to a cyberattack.

DeepSeek AI: A Study Guide

Short Answer Quiz
1. What is Janice Pro, and what are its key capabilities?
2. What is significant about DeepSeek’s R1 language model, in terms of cost and performance?
3. Describe the cybersecurity incident DeepSeek experienced and its impact.
4. How does DeepSeek’s approach to releasing models contrast with that of companies like OpenAI?
5. According to tests, where does Janice Pro excel and where does it fall short in image analysis?
6. What was the market reaction to DeepSeek’s success, and how did Nvidia’s stock perform?
7. How did OpenAI’s CEO Sam Altman respond to the emergence of DeepSeek’s AI?
8. How did President Trump’s administration react to DeepSeek’s success in AI development?
9. What are some concerns surrounding DeepSeek’s possible ties to the Chinese government?
10. What strategies does DeepSeek employ to achieve cost-effective AI development?
Short Answer Quiz – Answer Key
1. Janice Pro is a multimodal AI model family developed by DeepSeek that can handle tasks such as image generation (up to 768×768 resolution), image analysis, and text-based conversation. It aims to be an “all-in-one” AI solution.
2. DeepSeek’s R1 language model is significant because it reportedly matched GPT-4’s performance but was developed for only around $5-6 million, a dramatically lower cost than the billions spent by large AI labs.
3. DeepSeek experienced a cyberattack right after their AI assistant app reached the top of the Apple App Store, which resulted in website crashes and temporary registration limits. This incident happened as the app went viral.
4. Unlike companies like OpenAI that keep their models proprietary, DeepSeek has made the code and weights for their Janice Pro models open source, available for anyone to download and use on platforms like Hugging Face.
5. Janice Pro excels at straightforward image analysis, like describing the position and appearance of objects. However, it struggles with deeper reasoning tasks, such as interpreting metaphors or implied meanings in images.
6. The market reacted to DeepSeek’s success by causing a sharp downturn in tech stocks, with Nvidia’s stock plummeting by hundreds of billions of dollars due to the suggestion that expensive chips might not be necessary for top-tier AI development.
7. Sam Altman acknowledged being impressed by DeepSeek’s achievements but stated that OpenAI plans to respond by developing even better models while continuing to invest heavily in computing resources, not backing down from large spending.
8. President Trump characterized DeepSeek’s AI release as a wake-up call for US industries, advocating for a focus on competing to win in AI and unleashing American tech companies by removing some of the export restrictions.
9. Concerns about DeepSeek’s possible ties to the Chinese government include the potential for compromised user data or censorship, as some have noted the AI assistant avoids answering questions about the Chinese government or President Xi Jinping.
10. DeepSeek achieves cost-effective AI development using techniques such as focusing on relevant data, utilizing open-source projects from Alibaba and Meta, and finetuning them. These strategies allow them to save on computing resources.
Essay Questions
1. Analyze the potential implications of DeepSeek’s success for the current landscape of AI development and competition. Consider factors like the cost of development, accessibility of models, and the competitive strategies of major tech companies.
2. Discuss the significance of open-sourcing AI models like Janice Pro. What are the potential benefits and drawbacks of this approach, particularly when compared to the proprietary models of companies like OpenAI?
3. Explore the interplay of economic, political, and technological factors at play in the DeepSeek story. How do issues like trade restrictions, global competition, and geopolitical dynamics influence the trajectory of AI development?
4. Assess the performance of DeepSeek’s Janice Pro model by referencing specific details from the source. What are its strengths and limitations, and how does it compare to models from larger labs?
5. What conclusions can be drawn from DeepSeek’s rise regarding the need for massive budgets and resources in AI development? Should the traditional model of heavily funded, resource-intensive projects be re-evaluated, and what kind of changes might be beneficial for innovation and growth?
Glossary of Key Terms

Multimodal AI: Artificial intelligence systems that can process and understand multiple types of data, such as text, images, and audio, in a unified manner. Benchmarks (in AI): Standardized tests or datasets used to measure the performance of AI models in specific tasks, like image generation or natural language processing. Parameter (in AI Model): A variable that the AI model learns during training to adjust its performance. Larger parameter counts generally mean more complex models. Transformer Architecture: A neural network architecture that excels in sequence-to-sequence tasks, such as language translation, and that can be parallelized well on GPUs. It forms the basis of many large models today. Open Source: Software or data with its source code freely available and modifiable, as opposed to proprietary software. Hugging Face: A collaborative platform for AI and machine learning, including model repositories and datasets that enable the open-source movement. Generative Models: AI models that create new data instances, like images, text, or audio, that are similar to the data they were trained on. Fine-tuning: A process where a pre-trained model is further trained using a more specific dataset to enhance its capabilities for a target task. API (Application Programming Interface): A set of rules and protocols that allows different software applications to communicate with each other. Artificial General Intelligence (AGI): A hypothetical type of AI that has human-level general intelligence and can perform any intellectual task that a human being can.

DeepSeek: Disrupting the AI Landscape

Okay, here’s a detailed briefing document summarizing the key themes and ideas from the provided text about DeepSeek’s recent AI advancements:

Briefing Document: DeepSeek’s Rise and Impact on the AI Landscape

Executive Summary:

This document analyzes the recent emergence of DeepSeek, a Chinese AI company that has disrupted the industry with its highly performant yet cost-effective AI models. DeepSeek’s R1 language model and Janice Pro multimodal model, trained on less expensive hardware, have challenged the established dominance of Western tech giants, raising questions about the current AI development strategies and the effectiveness of US export controls. The company’s open-source approach, combined with its rapid rise in popularity, has triggered stock market volatility, political discussions, and a scramble among competitors to re-evaluate their approaches.

Key Themes and Ideas:
1. Disruptive Performance and Efficiency:
- Janice Pro Model: DeepSeek’s multimodal AI model family, particularly the 7B version, has shown impressive performance on benchmarks like Gen-Eval and DPG Bench, allegedly surpassing established models like OpenAI’s Dolly 3, Pixar Alpha, and Emu3 Gen.
- Quote: “…this model supposedly beats open AI Dolly 3 and some other big names like Pixar Alpha and emu3 gen on benchmarks like gen evl and DPG bench.”
- R1 Language Model: DeepSeek’s R1 language model reportedly matched performance similar to GPT-4, but was developed at a drastically lower cost (around $5-6 million compared to billions spent by Silicon Valley labs).
- Quote: “…it apparently matched 0 one’s performance but get this while costing only around5 or6 million to develop compare that to the billions that big AI labs in Silicon Valley are spending.”
- Cost-Effectiveness: DeepSeek’s achievements challenge the assumption that vast resources are required for leading-edge AI, suggesting that innovative training techniques can yield similar results at a fraction of the cost.
- Quote: “…if a chinese startup can replicate results at a tenth of the usual cost…”
1. Open-Source Approach vs. Proprietary Models:
- Open Source: Unlike companies like OpenAI, DeepSeek has open-sourced both the code and weights of its Janice Pro models on Hugging Face, allowing the community to freely access, use, and modify them.
- Quote: “deep seek put the models code and weights up on hugging face for anyone to download right away that’s in start contrast to companies like open AI that keep everything behind closed doors and proprietary apis.”
- Community Driven Development: This approach allows for rapid iteration and improvements by the broader AI community, potentially enhancing the models further.
- Quote: “…people out there can Tinker apply specialized data sets improve the code and basically push the model to new heights…”
- Potential for Fine-Tuning: The open-source nature enables users to fine-tune the models for specific tasks or domains.
1. Multimodal Capabilities and Performance Analysis:
- Versatile Functionality: Janice Pro is presented as a unified Transformer architecture capable of image generation, image analysis, and text-based tasks.
- Image Analysis: While it excels at describing basic elements in images, it falls short of understanding complex, implied meanings.
- Quote: “…it did well at describing straightforward things like the position of objects or their appearance but it kind of fell short when deeper reasoning was required.”
- Image Generation: Janice Pro can produce decent images, but might lack sharpness or artistic flair compared to specialized models like Stable Diffusion.
- Quote: “…Janice Pro can produce decent images but might struggle in certain areas like overall sharpness or artistic flare compared to specialized state-of-the-art image models…”
- Strengths: Versatility and fidelity to text prompts appear to be areas of strength.
1. Market and Financial Impact:
- Stock Market Volatility: DeepSeek’s emergence led to a significant drop in Nvidia’s stock price, suggesting a perceived shift in the demand for high-end AI chips.
- Quote: “…nvidia’s shares reportedly plummeted causing a huge dip in market value like $600 billion do in a single day…”
- Reevaluation of AI Investment: Investors and tech companies are re-evaluating the necessity of large-scale investments in computing infrastructure for AI development.
- Quote: “…people started questioning whether the AI investment arms race is misguided if a chinese startup can replicate results at a tenth of the usual cost…”
- Challenge to Big Tech: The rapid rise of DeepSeek has unsettled large AI companies like OpenAI, prompting a re-evaluation of their strategies.
- Quote: “…the assumption that you need billions of dollars and thousands of the absolute best Nvidia chips to train competitive AI might be wrong at least that’s what deep seek is suggesting…”
1. Geopolitical and Strategic Implications:
- US Export Controls: DeepSeek’s success raises questions about the effectiveness of US export controls on advanced chips aimed at slowing down China’s AI advancements.
- Quote: “there’s talk about how us export controls on Advanced chips particularly from Nvidia are meant to slow down Chinese AI progress yet deep seek claims they used nvidia’s h800 chips for training…”
- Political Reaction: President Trump’s comments reflect the political concern over losing technological leadership and the need for the US to regain its competitive edge.
- Quote: “President Trump …commented that the release of deep seek AI from a Chinese company should be a wakeup call for our industries…”
- National Security Concerns: There are concerns about DeepSeek’s potential ties to the Chinese government and the implications for data security and censorship.
- Quote: “…some critics worry about possible security risks the question arises could deep seek be closely tied to the Chinese government in ways that compromise user data or lead to censorship”
1. DeepSeek’s Rapid Rise and Challenges
- Viral Popularity: DeepSeek’s AI assistant app quickly rose to the top of Apple’s App Store in the US, surpassing even ChatGPT in popularity.
- Server Overload: The surge in users resulted in server outages and temporary restrictions on registrations.
- Cyberattack: DeepSeek experienced a cyberattack coinciding with their app’s popularity surge, further disrupting their services.
1. DeepSeek’s Methodology and Data:
- Training Techniques: DeepSeek claims to have used new training techniques that focus on the most relevant data, leading to significant computational resource savings.
- Open-Source Reliance: They also leveraged existing open-source projects from Alibaba and Meta, fine-tuning them for their specific models.
- Quote: “they also say they used opsource projects from Alibaba and meta as a springboard fine-tuning them to create their final product…”
- Cost Discrepancy: Questions remain about the accuracy of DeepSeek’s cost reporting ($5.6 million claimed) with many believing it to be higher, but still comparatively lower than Western tech giants.
- Quote: “…the company said they only spent about $5.6 million on training their V3 model but that’s just the final training Pass that might not reflect all the prior experiments and data curation that went into it…”
1. The Future of AI Development:
- Open vs. Closed: The emergence of DeepSeek has intensified the debate on whether the future of AI development will be dominated by open or closed ecosystems.
- Agile vs. Monolithic: DeepSeek’s success challenges the idea that only large, heavily funded companies can achieve significant breakthroughs in AI, indicating that smaller, more agile teams can also be competitive through innovative methods.
- Existential Risks: The rapid advancements are raising concerns about the existential risks associated with pushing towards super-intelligent AI systems.
Conclusion:

DeepSeek’s sudden rise represents a paradigm shift in the AI landscape, challenging the current industry model dominated by large Western tech corporations. The company’s cost-effective methods, combined with its open-source strategy, have ignited widespread debate, triggering market and political ramifications. Whether DeepSeek’s approach is sustainable remains to be seen, but its impact on the AI ecosystem is undeniable. The next phase will likely see established giants scrambling to adapt, open-source community efforts intensifying, and ongoing discussions about the ethical and strategic implications of AI advancements.

This briefing document provides a comprehensive overview of the key points from the provided text. Let me know if you have any other questions.

DeepSeek AI: A Disruptive Force in AI Development

Frequently Asked Questions about DeepSeek AI
1. What is DeepSeek AI, and what are their notable recent achievements? DeepSeek AI is a relatively new AI company based in Hong Kong, China that has rapidly gained attention for developing highly competitive AI models at a fraction of the cost typically associated with such advancements. They’ve released a multimodal AI model family called Janice Pro, with the 7B version reportedly outperforming models like OpenAI’s Dolly 3 on certain benchmarks. Additionally, their R1 language model has demonstrated performance comparable to GPT-4 while costing significantly less to develop. These achievements have led to questions about the cost-effectiveness of current AI development strategies.
2. How does DeepSeek’s Janice Pro model compare to other AI models, specifically regarding image generation and analysis? Janice Pro is designed as a versatile, unified model capable of image generation, analysis, and text-based tasks. While it can generate decent quality images up to 768×768 resolution, it may not achieve the same level of sharpness or artistic flair as specialized models like Stable Diffusion. In image analysis, Janice Pro excels at straightforward object descriptions but struggles with tasks requiring deeper reasoning, like interpreting metaphors. Its strength lies more in versatility than in being the absolute best in any specific area.
3. What is the significance of DeepSeek open-sourcing their models, such as Janice Pro? DeepSeek’s decision to make the code and weights for their models available on platforms like Hugging Face is a significant departure from the approach of companies like OpenAI that keep their models proprietary. This open-source approach allows the broader community to download, use, and potentially improve the models. It fosters collaborative development and rapid evolution through community fine-tuning and adaptation using specialized datasets.
4. How did DeepSeek achieve GPT-4 level performance with their R1 model at such a low cost compared to major players? DeepSeek claims to have achieved comparable performance to GPT-4 while spending only around $5-6 million to develop the R1 model, in contrast to the billions spent by larger AI labs. They attribute this cost advantage to employing more efficient training techniques such as focusing on the most relevant data, utilizing open-source projects from Alibaba and Meta as a base, and avoiding the use of the most cutting-edge chips. This challenges the assumption that massive capital expenditure is required for cutting-edge AI advancement.
5. How has DeepSeek’s emergence impacted the tech industry, particularly in the stock market and among leading AI companies? DeepSeek’s success has shaken the tech industry, leading to a dramatic drop in Nvidia’s stock value as investors question the necessity for top-end chips in AI development. It has also spurred a conversation about whether major tech companies are overspending on AI research and development. Major players such as OpenAI are responding by reasserting the need for significant computing resources, but also recognizing the impressive results of DeepSeek.
6. What political and economic angles have arisen due to DeepSeek’s emergence as a Chinese AI player? DeepSeek’s rise has intensified debates about the effectiveness of US export controls on advanced chips aimed at slowing down Chinese AI progress. The company’s use of less powerful H800 chips to achieve high performance is calling into question the necessity of top-end chips. It is also fueling political discussions about global competition in the AI space. There are concerns about whether the Chinese government may have influence over or access to DeepSeek AI.
7. What are the potential security and censorship concerns associated with DeepSeek’s AI models? Due to DeepSeek’s location in China, there are concerns about possible ties to the Chinese government and how that may impact user privacy or lead to censorship. Some have reported that the company’s AI assistant will not answer questions pertaining to the Chinese government or President Xi Jinping, raising concerns about potential limitations and biases within the AI models.
8. What does DeepSeek’s success suggest about the future of AI development and the balance of power in the industry? DeepSeek’s success story suggests that smaller, more agile teams can compete effectively with large, established players by employing innovative training techniques and making use of open-source resources. It raises the possibility of more cost-effective and diverse approaches to AI development. It is a call to established leaders to innovate beyond simply spending huge sums on computing power, potentially leading to a more balanced AI landscape that is not solely dominated by a few mega corporations.
DeepSeek’s AI Models: Cost, Performance, and Impact

DeepSeek has released several AI models that have garnered significant attention, particularly for their performance and cost-effectiveness [1, 2]. Here’s a breakdown of their key models:
- Janice Pro: This is a multimodal AI model family capable of image generation (up to 768 x 768 resolution), image analysis, and text-based tasks [1, 2]. It utilizes a unified Transformer architecture [2].
- It comes in different sizes, with the largest being the 7B version, which is considered their flagship model [2].
- Janice Pro 7B is reported to outperform models like OpenAI’s Dolly 3, Pixar Alpha, and Emu3 Gen on benchmarks like Gen-Eval and DPG Bench, according to DeepSeek’s internal tests [1].
- While it can accurately describe objects and their positions, it struggles with deeper reasoning, such as interpreting metaphors in images, unlike GPT-4 Vision [2].
- In image generation, it produces decent images but may lack sharpness or artistic flair compared to specialized models [2]. However, it can be more faithful to the prompt [2].
- The entire model is open source, with code and weights available on Hugging Face for download [2].
- DeepSeek’s official space on Hugging Face isn’t active yet so some users have created their own spaces to test Janice 7B [3].
- R1 Language Model: This language model is notable for apparently matching GPT-4’s performance, but at a fraction of the cost (around $5-6 million to develop) [1]. This is in contrast to the billions spent by big AI labs [1].
- The R1 model’s performance has led to questions about whether the AI industry is overspending on development [1].
Key Takeaways about DeepSeek’s Models:
- Cost-Effectiveness: DeepSeek’s models are developed at a significantly lower cost than those of major AI companies, raising questions about the necessity of massive spending in AI development [1, 3, 4].
- Open Source Approach: DeepSeek releases its models with open-source code and weights, contrasting with the proprietary approach of companies like OpenAI [2]. This allows for community fine-tuning and improvement [2, 3].
- Multimodal Capabilities: Janice Pro’s ability to handle both image and text tasks is a key advantage [2].
- Performance: While DeepSeek claims their models outperform others in certain benchmarks, user testing has revealed areas where they fall short, such as deeper image understanding and image quality [1, 2].
- Impact: DeepSeek’s advancements have impacted the stock market, with a significant dip in Nvidia’s shares, and has also led to discussions about export controls and AI dominance [3, 4].
DeepSeek’s emergence as a significant player in the AI field is forcing major tech companies to reconsider their strategies and investments in AI research [5, 6].

DeepSeek’s Cost-Effective AI Revolution

DeepSeek’s AI models have brought the concept of cost-effective AI to the forefront, challenging the prevailing notion that massive spending is necessary for achieving top-tier results [1-3]. Here’s a breakdown of how DeepSeek is impacting the discussion around cost-effective AI:
- Lower Development Costs: DeepSeek’s R1 language model reportedly matched GPT-4’s performance at a development cost of only $5-6 million, compared to the billions spent by major AI labs [1]. This significant difference raises questions about whether the AI industry is overspending on development [1, 2]. DeepSeek claims they spent only about $5.6 million on the final training of their V3 model [3]. Even if the total cost was a few times higher than that, it is still much lower than what is spent by American tech giants [3].
- Efficient Training Methods: DeepSeek attributes its lower costs to new training techniques, including methods that allow the model to focus on the most relevant sections of data, saving computing resources [3]. They also utilized open-source projects from Alibaba and Meta as a starting point, fine-tuning them to create their models [3]. This approach has sparked debate, with some criticizing DeepSeek for leveraging Western open-source frameworks [3].
- Impact on the Industry:
- The success of DeepSeek has caused a stir in the stock market, with Nvidia’s shares plummeting due to the possibility that top-tier AI models can be trained without the most advanced chips [2]. This questions the previously assumed link between high-end hardware and AI performance [2].
- Major tech companies like Microsoft, Meta, Alphabet, Amazon, and Oracle, which have been allocating massive budgets for AI research and development (R&D) and infrastructure, are now facing questions about their spending strategies [4]. For example, OpenAI has plans to spend up to $500 billion to build a global network of data centers [4].
- DeepSeek’s success has led to discussions on whether smaller, agile teams can compete with the big players by employing cost-effective methods [5].
- Open Source Contributions: DeepSeek’s open-source approach further emphasizes cost-effectiveness by enabling community fine-tuning and improvement of the models [6]. By making the code and weights available on Hugging Face, DeepSeek allows others to contribute to the development and potentially enhance the models further [6].
In summary, DeepSeek has emerged as a significant player challenging the status quo of AI development by demonstrating that high performance doesn’t necessarily require massive spending [1, 5]. Their cost-effective methods and open-source approach have sparked debate and are forcing major tech companies to reevaluate their strategies [2, 5].

DeepSeek’s Open-Source AI Revolution

Open-source AI is a key aspect of DeepSeek’s approach and has significant implications for the broader AI landscape. Here’s a breakdown of how DeepSeek is contributing to the open-source AI movement:
- Accessibility and Transparency: DeepSeek has made the code and weights of its Janice Pro models available on Hugging Face for anyone to download [1]. This open-source approach contrasts with the proprietary methods of companies like OpenAI, which keep their models behind closed doors [1]. By making their models open-source, DeepSeek allows for greater accessibility and transparency in AI development.
- Community-Driven Improvement: DeepSeek’s open-source strategy enables community involvement in the improvement of its models [1]. The community can fine-tune the models with specialized data sets, enhance the code, and push the models to new heights [1]. This collaborative approach can lead to faster advancements and innovation. The official DeepSeek space on Hugging Face is not yet active, so community members have created their own spaces to test the Janice 7B model [1].
- Challenging the Status Quo: DeepSeek’s open-source approach challenges the notion that cutting-edge AI development must be dominated by well-funded labs [2]. By making their models accessible, DeepSeek empowers smaller teams and individual researchers to participate in AI innovation [3, 4].
- Cost-Effectiveness: By utilizing open-source projects from Alibaba and Meta as a starting point, DeepSeek has demonstrated that it is possible to develop high-performing models at a significantly lower cost [3]. This approach allows DeepSeek to leverage existing resources and technologies, reducing the need for massive investments in R&D [3].
- Broader Impact: The open-source nature of DeepSeek’s models has sparked debate about the competitive landscape in AI and has led to discussions about the sustainability of large-scale investments by major tech companies [2, 5, 6]. It raises questions about whether smaller, more agile teams using open-source tools and methodologies can outperform well-resourced companies [3, 4]. The success of DeepSeek, which used open source projects, has caused some frustration at Meta because they have the resources but were outperformed [3].
- Potential Security Risks: While DeepSeek’s open-source approach promotes collaboration and accessibility, it also raises concerns about potential security risks. Some critics worry about the possibility that DeepSeek could be closely tied to the Chinese government and that user data could be compromised or subject to censorship [6]. There have been reports that DeepSeek’s AI assistant will not answer questions about the Chinese government or president Xi Jinping [6].
In summary, DeepSeek’s commitment to open-source AI is a major factor in its impact on the AI industry. By providing open access to its models and source code, DeepSeek is driving innovation and collaboration, challenging the dominance of well-funded AI labs, and prompting discussions about the future of AI development and accessibility [1, 3, 4].

DeepSeek and Geopolitical Implications of AI

DeepSeek’s emergence as a significant player in the AI field has sparked several geopolitical implications, particularly concerning technology competition, export controls, and national security [1-3].
- Technology Competition: DeepSeek, a Chinese company, has developed AI models that rival those of leading US tech companies, such as OpenAI, but at a fraction of the cost [1, 4]. This has led to concerns that the US may be falling behind in the AI race [2]. The fact that a Chinese company was able to produce a model comparable to GPT-4 using fewer resources raises questions about the effectiveness of current strategies and investments by American labs [1, 2]. The success of DeepSeek is seen as a potential “wakeup call” for US industries, prompting discussions about the need to focus on competing and winning in the tech sector [2].
- Export Controls: The US has imposed export controls on advanced chips, particularly from Nvidia, to slow down China’s AI progress [1]. However, DeepSeek claims to have used Nvidia’s H800 chips, which are less powerful than the restricted high-end chips, to achieve results comparable to GPT-4 [1]. This development has fueled the debate about the effectiveness of export controls [1, 2]. If Chinese companies can achieve significant AI advancements using available resources, it calls into question the efficacy of the current restrictions [1].
- National Security: DeepSeek’s rapid rise and success have raised national security concerns [3]. Some critics worry that DeepSeek could be closely tied to the Chinese government, potentially leading to compromised user data or censorship [3]. There have been reports that DeepSeek’s AI assistant does not answer questions about the Chinese government or President Xi Jinping, leading to speculation about its level of independence [3]. The concern is that if AI technology is controlled or influenced by foreign governments, it could pose risks to national security and privacy [3].
- Global Impact: DeepSeek’s success has also had a global impact, affecting stock prices and investment trends [2, 5]. The dip in Nvidia’s stock prices after DeepSeek’s achievements indicates that the market is reassessing the value of high-end chips for AI training [2]. This shift has significant implications for investment strategies in the tech industry, as it suggests that high-performance AI may be achieved without massive capital expenditure [2, 3].
- Open Source vs Proprietary: The open-source nature of DeepSeek’s models is also significant [4, 6]. By making their models available to the public, DeepSeek promotes innovation, but it also creates an environment where their technology could be adapted or used by entities that may not align with the interests of the US or its allies [4, 6]. This raises further questions about the implications of open-source AI in a competitive global environment [4, 6].
In conclusion, DeepSeek’s rapid rise in the AI landscape has brought about several geopolitical implications, forcing countries to reevaluate their tech strategies, export control policies, and national security protocols. The company’s ability to produce high-performing AI models at a lower cost has disrupted the existing power dynamics and highlighted the importance of efficient and cost-effective AI development methods [1, 2, 4, 5].

DeepSeek’s Disruption of the AI Industry

DeepSeek’s emergence as a significant player in the AI field has caused considerable disruption in the AI industry, challenging established norms and prompting major shifts in various aspects of AI development, investment, and global competition [1-3]. Here’s a breakdown of the key areas where DeepSeek is driving disruption:
- Challenging the Need for Massive Spending: DeepSeek’s ability to develop high-performing AI models like the R1 language model and the Janice Pro family at a fraction of the cost compared to major AI labs has questioned the necessity of massive spending in AI development [1, 2, 4]. The R1 model reportedly matched GPT-4’s performance with only around $5-6 million in development costs, while the final training pass of the V3 model cost about $5.6 million [1, 5]. This is in stark contrast to the billions of dollars spent by companies like OpenAI and others [1, 3]. DeepSeek’s efficient training methods, such as focusing on the most relevant data and utilizing open-source projects [5], have demonstrated that high-performance AI can be achieved without exorbitant budgets. This has led to a reevaluation of investment strategies and a questioning of whether the AI industry has been overspending [1, 2].
- Open-Source vs. Proprietary Approaches: DeepSeek’s commitment to open-source AI by making the code and weights of its Janice Pro models available on Hugging Face [4] has disrupted the traditional proprietary approach of companies like OpenAI [4, 5]. By open-sourcing its models, DeepSeek is promoting transparency, accessibility, and community-driven innovation [4]. This shift challenges the dominance of closed-off models and enables smaller teams and individual researchers to participate in AI development [4, 5]. It also enables community fine-tuning and improvement, potentially leading to faster advancements [4].
- Stock Market Repercussions: The success of DeepSeek has had a significant impact on the stock market, particularly for companies that manufacture advanced chips like Nvidia. The fact that DeepSeek was able to achieve results comparable to GPT-4 using less powerful chips caused Nvidia’s shares to plummet, resulting in a huge loss in market value [2]. This is because the market is now questioning the link between high-end hardware and AI performance and the assumption that top-tier AI models require the most cutting-edge and expensive chips to train [2, 3].
- Re-evaluation of Investment Strategies: The demonstration that it is possible to develop top-tier AI at lower costs is forcing major tech companies to reevaluate their massive investments in AI R&D and infrastructure [3]. Companies like Microsoft, Meta, Alphabet, Amazon, and Oracle, which are spending billions on AI research and data centers [3], are facing scrutiny due to DeepSeek’s example of a cost-effective approach [2, 3]. OpenAI’s plans to spend up to $500 billion on a global network of data centers are also now being questioned in light of DeepSeek’s success [3].
- Geopolitical Implications: DeepSeek’s emergence as a Chinese AI company that can compete with US tech giants [1, 2] has significant geopolitical implications, raising questions about technology competition and export controls [1-3]. The ability of DeepSeek to achieve comparable results with less powerful chips challenges the effectiveness of export controls [1]. There are also national security concerns about DeepSeek’s potential ties to the Chinese government and whether that could compromise user data or lead to censorship [3].
- Shifting Power Dynamics: DeepSeek’s rise suggests that smaller, agile teams can compete with well-resourced companies by employing cost-effective and open-source methods [1, 5]. This has sparked debate about whether the AI industry will see more innovation coming from smaller teams that are clever with their methods [1, 6].
In conclusion, DeepSeek is disrupting the AI industry by demonstrating that high-performance AI can be achieved with less spending, challenging the dominance of proprietary AI models, impacting the stock market, forcing a reevaluation of investment strategies, raising geopolitical concerns, and shifting the balance of power within the AI landscape [1-5]. The company’s success is forcing a reconsideration of the long-held assumptions about the costs and strategies associated with AI development and is driving a move towards more efficient, open, and accessible AI [1, 6].

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
January 31, 2025
DeepSeek: A Crash Course in Local LLMs
This video tutorial explores DeepSeek, a Chinese company producing open-source large language models (LLMs). The instructor demonstrates using DeepSeek’s AI-powered assistant online and then focuses on downloading and running various sized DeepSeek R1 models locally using different tools like Olama and LM Studio. He tests the models on two different machines: an Intel Lunar Lake AI PC dev kit and a workstation with an RTX 480 graphics card, highlighting hardware limitations and optimization techniques. The tutorial also covers using the Hugging Face Transformers library for programmatic access to DeepSeek models, encountering and troubleshooting various challenges along the way, including memory constraints and model optimization issues. Finally, the instructor shares insights on the challenges and potential of running these models locally versus using cloud-based solutions.

DeepSeek AI Model Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What is DeepSeek and what is unique about their approach to LLMs?
2. Briefly describe the key differences between DeepSeek R1, R10, and V3 models.
3. Why is the speculated cost reduction of DeepSeek models a significant factor?
4. What hardware was used to test DeepSeek models and why were these choices made?
5. What is an igpu, and how is it utilized by the AI models?
6. What were the results of using the deepseek.com AI assistant?
7. What is olama, and how does it assist with local model deployment?
8. Explain the concept of “distilled” models in the context of DeepSeek.
9. What is LM Studio and how does it differ from olama in its deployment of LLMs?
10. What were some of the challenges encountered when attempting to run DeepSeek models locally?
Quiz Answer Key
1. DeepSeek is a Chinese company that develops open-weight large language models (LLMs). They are unique in their focus on cost reduction, aiming to achieve similar performance to models like OpenAI’s at a fraction of the cost, specifically due to optimizaitons.
2. R10 is a model trained with reinforcement learning that exhibited reasoning capabilities but had readability issues. R1 was further trained to mitigate these issues. V3 is a more advanced model with additional capabilities, including vision processing, and a mixture of experts.
3. The speculated 95-97% cost reduction is significant because training and running large language models typically cost millions of dollars. This drastic reduction suggests these models can be trained and used by those with smaller budgets.
4. An Intel Lunar Lake AI PC dev kit (mobile chip with an igpu and mpu) and a Precision Tower workstation with an RTX 4080 were used. These were chosen to test the model’s performance on different levels of hardware, including consumer-grade chips and dedicated graphics cards.
5. An igpu is an integrated graphics processing unit, built into a chip to help run AI models. In particular, in these newer chips they are intended to help run the models alongside mpus in ways where discrete GPUs are not necessary for running small models.
6. The deepseek.com AI assistant, which runs the V3 model, showed strong performance in text analysis and vision capabilities. It correctly extracted Japanese text from an image, but it did have some issues following all of the prompt instructions.
7. Olama is a tool that allows users to download and run large language models locally through the terminal, especially utilizing the gguf file format. This makes working with the models easier for a user via the command line interface on their local machines.
8. Distilled models are smaller versions of larger models, created through knowledge transfer from a more complex model. These smaller models retain similar capabilities to the larger model while being more efficient to run on local machines.
9. LM Studio provides a more user-friendly interface for deploying and interacting with large language models. Unlike olama, which requires terminal commands, LM Studio has a chat-like interface that allows for a more conversational model experience, but with some additional agentic features.
10. Challenges included running into computer restarts due to resource exhaustion on local hardware, GPU limitations, incompatibility of certain model formats, optimization and the lack of specific optimization tools for integrated graphics processing units on some devices.
Essay Questions

Instructions: Answer the following essay questions in a detailed format, using supporting evidence from the source material.
1. Analyze the claims made about the cost-effectiveness of DeepSeek models. How might this impact the development and accessibility of AI models?
  
  The claims about the cost-effectiveness of DeepSeek models suggest that these models offer a more efficient balance between performance and cost compared to other AI models. This could have several significant impacts on the development and accessibility of AI models:
  Increased Accessibility: Lower costs make it feasible for a broader range of users, including smaller businesses, researchers, and individual developers, to access and utilize advanced AI models. This democratization of AI technology can lead to more widespread innovation and application across various fields.
  Accelerated Development: Cost-effective models can reduce the financial barriers to entry for AI development. This can encourage more startups and research institutions to experiment with and develop new AI applications, potentially accelerating the pace of innovation in the field.
  Resource Allocation: With lower costs, organizations can allocate resources more efficiently, potentially investing more in areas such as data acquisition, model fine-tuning, and application development rather than spending heavily on computational resources.
  Competitive Market: The availability of cost-effective models can increase competition among AI providers. This competition can drive further improvements in model efficiency, performance, and cost, benefiting end-users.
  Sustainability: More cost-effective models often imply better optimization and lower energy consumption, contributing to the sustainability of AI technologies. This is increasingly important as the environmental impact of large-scale AI computations comes under scrutiny.
  Broader Applications: Lower costs can enable the deployment of AI models in a wider range of applications, including those with tighter budget constraints. This can lead to the integration of AI in sectors that previously could not afford such technologies, such as education, healthcare, and non-profit organizations.
  Research and Education: Educational institutions and research labs can benefit from cost-effective models by incorporating them into curricula and research projects. This can help in training the next generation of AI practitioners and researchers without the prohibitive costs associated with high-end models.
2. Overall, the cost-effectiveness of DeepSeek models can significantly lower the barriers to entry for AI development and usage, fostering a more inclusive and innovative ecosystem. This can lead to a more rapid advancement and adoption of AI technologies across various domains.
  
  Absolutely, the cost-effectiveness of DeepSeek models has the potential to be a game-changer in the AI landscape. By lowering the barriers to entry, these models can foster a more inclusive and innovative ecosystem, which can have far-reaching implications:
  Democratization of AI: Lower costs mean that more individuals and organizations, including those with limited budgets, can access advanced AI capabilities. This democratization can lead to a more diverse range of voices and perspectives contributing to AI development, resulting in more robust and equitable AI solutions.
  Enhanced Innovation: With reduced financial constraints, startups and smaller research teams can experiment with and develop new AI applications. This can lead to a surge in innovation, as more players are able to participate in the AI space and bring fresh ideas to the table.
  Broader Adoption: Cost-effective models make it feasible for industries and sectors that previously could not afford AI technologies to integrate them into their operations. This can lead to widespread adoption across fields such as healthcare, education, agriculture, and more, driving efficiency and innovation in these areas.
  Educational Opportunities: Lower costs can also benefit educational institutions by making it easier to incorporate AI into curricula. This can help in training the next generation of AI practitioners and researchers, ensuring a steady pipeline of talent to drive future advancements.
  Resource Optimization: Organizations can allocate their resources more efficiently, investing in areas such as data acquisition, model fine-tuning, and application development rather than spending heavily on computational resources. This can lead to more effective and targeted use of AI technologies.
  Sustainability: More cost-effective models often imply better optimization and lower energy consumption, contributing to the sustainability of AI technologies. This is increasingly important as the environmental impact of large-scale AI computations comes under scrutiny.
  Competitive Market Dynamics: The availability of cost-effective models can increase competition among AI providers. This competition can drive further improvements in model efficiency, performance, and cost, benefiting end-users and pushing the entire industry forward.
  Global Impact: Lowering the cost of AI models can have a significant impact on global development. Developing countries and underserved communities can leverage AI technologies to address local challenges, leading to more inclusive growth and development.
  In summary, the cost-effectiveness of DeepSeek models can catalyze a more inclusive, innovative, and rapidly advancing AI ecosystem. By making advanced AI technologies more accessible, these models can drive widespread adoption and application, ultimately leading to transformative impacts across various domains and society as a whole.
3. Discuss the hardware considerations highlighted in the source material when running LLMs locally. What is the trade-off between cost and performance?
  
  Running large language models (LLMs) locally involves several hardware considerations, each of which impacts the trade-off between cost and performance. Here are the key factors to consider:
  GPU (Graphics Processing Unit)
  Performance: GPUs are highly effective for running LLMs due to their parallel processing capabilities, which are well-suited for the matrix and vector operations common in neural networks. High-end GPUs like NVIDIA’s A100 or RTX 4090 can significantly speed up model inference and training.
  Cost: High-performance GPUs are expensive. The cost can range from several hundred to thousands of dollars per unit. Additionally, running multiple GPUs in parallel can further increase costs
  CPU (Central Processing Unit)
  Performance: While CPUs can run LLMs, they are generally slower compared to GPUs due to their sequential processing nature. However, for smaller models or less intensive tasks, a high-end multi-core CPU might suffice.
  Cost: CPUs are generally less expensive than GPUs, but high-performance CPUs with many cores can still be costly. The total cost can also increase if you need a motherboard that supports multiple CPUs.
  Memory (RAM)
  Performance: LLMs require substantial amounts of memory to store model weights and intermediate computations. Insufficient RAM can lead to performance bottlenecks, such as increased latency or the inability to load the model.
  Cost: High-capacity RAM (e.g., 64GB, 128GB, or more) is expensive. The cost increases exponentially with the amount of RAM, especially for faster types like DDR4 or DDR5.
  Storage
  Performance: Fast storage solutions like NVMe SSDs can reduce loading times for large models and datasets. Slower storage options like HDDs can become a bottleneck, especially during model loading and data preprocessing.
  Cost: NVMe SSDs are more expensive than traditional HDDs. The cost can add up quickly if you need large storage capacities (e.g., several terabytes).
  Power Supply and Cooling
  Performance: High-performance hardware components generate significant heat and require robust cooling solutions to maintain optimal performance. Inadequate cooling can lead to thermal throttling, reducing performance.
  Cost: High-quality cooling solutions (e.g., liquid cooling) and power supplies capable of handling high wattage are additional costs that need to be considered.
  Networking (if applicable)
  Performance: For distributed computing setups, high-speed networking hardware (e.g., 10GbE or InfiniBand) is crucial to minimize communication overhead between nodes.
  Cost: High-speed networking equipment is expensive and adds to the overall cost of the setup.
  Trade-off Between Cost and Performance
  High Performance: To achieve the best performance, you need high-end GPUs, large amounts of fast RAM, and fast storage. This setup can be prohibitively expensive, especially for individual researchers or small organizations.
  Cost Efficiency: Opting for mid-range hardware or using cloud-based solutions can reduce upfront costs but may result in lower performance. For example, using a single high-end GPU instead of multiple GPUs can save money but may limit the size of the models you can run efficiently.
  Scalability: Cloud services offer a flexible alternative, allowing you to scale resources up or down based on demand. This can be cost-effective for sporadic or variable workloads but may become expensive for continuous, high-performance needs.
  Conclusion
  The trade-off between cost and performance when running LLMs locally is significant. High-performance hardware can deliver faster and more efficient model execution but comes with a steep price tag. Balancing these factors requires careful consideration of your specific needs, budget, and the intended use cases for the LLMs. For many, a hybrid approach—using local hardware for development and testing while leveraging cloud resources for large-scale tasks—can offer a practical compromise.
4. Compare and contrast the various methods used to deploy DeepSeek models in the crash course, from using the website to local deployment via olama and LM Studio, and using hugging face.
  
  Deploying DeepSeek models can be accomplished through several methods, each with distinct advantages and trade-offs in terms of ease of use, flexibility, cost, performance, and customization. Below is a comparison of common deployment approaches, including using the DeepSeek website, local deployment via Ollama or LM Studio, and leveraging Hugging Face:
  
  DeepSeek Website (SaaS/Cloud-Based)
  Ease of Use:
  Simplest method; no technical setup required.
  Users interact via a web interface or API, ideal for non-technical users.
  Flexibility:
  Limited customization (e.g., fine-tuning, model adjustments).
  Pre-configured models with fixed parameters and output formats.
  Cost:
  Typically pay-as-you-go or subscription-based pricing.
  No upfront hardware costs, but recurring fees for heavy usage.
  Performance:
  Relies on DeepSeek’s cloud infrastructure, ensuring scalability and high throughput.
  Latency depends on internet connection and server load.
  Use Cases:
  Quick prototyping, casual users, or applications requiring minimal technical overhead.
  
  Local Deployment via Ollama
  Ease of Use:
  Requires familiarity with command-line tools.
  Models are downloaded and run locally via simple commands (e.g., ollama run deepseek).
  Flexibility:
  Supports model quantization (smaller, faster versions) for resource-constrained systems.
  Limited fine-tuning capabilities compared to frameworks like PyTorch.
  Cost:
  Free to use (open-source), but requires local hardware (GPU/CPU).
  Upfront cost for powerful hardware if running large models.
  Performance:
  Depends on local hardware (e.g., GPU VRAM for acceleration).
  Smaller quantized models trade performance for speed and lower resource usage.
  Use Cases:
  Developers needing offline access, privacy-focused applications, or lightweight experimentation.
  
  Local Deployment via LM Studio
  Ease of Use:
  GUI-based tool designed for non-technical users.
  Simplifies model downloads and inference (no coding required).
  Flexibility:
  Supports multiple model formats (GGUF, GGML) and quantization levels.
  Limited fine-tuning; focused on inference and experimentation.
  Cost:
  Free software, but hardware costs apply (similar to Ollama).
  Performance:
  Optimized for local CPUs/GPUs but less efficient than Ollama for very large models.
  Good for smaller models or machines with moderate specs.
  Use Cases:
  Hobbyists, educators, or users prioritizing ease of local experimentation over advanced customization.
  
  Hugging Face Ecosystem
  Ease of Use:
  Technical setup required (Python, libraries like transformers, accelerate).
  Offers both cloud-based Inference API and local deployment options.
  Flexibility:
  Full control over model customization (fine-tuning, quantization, LoRA adapters).
  Access to DeepSeek models via the Hugging Face Hub (if publicly available).
  Cost:
  Free for local deployment (hardware costs apply).
  Inference API has usage-based pricing for cloud access.
  Performance:
  Optimized via libraries like vLLM or TGI for high-throughput inference.
  Requires technical expertise to maximize hardware utilization (e.g., GPU parallelization).
  Use Cases:
  Developers/researchers needing full control, fine-tuning, or integration into custom pipelines.
  
  When to Use Which?
  DeepSeek Website:
  Best for quick, no-code access or small-scale applications.
  Avoid if you need offline use, customization, or cost control.
  Ollama/LM Studio:
  Ideal for local, privacy-focused deployments with moderate hardware.
  Ollama suits developers; LM Studio targets non-technical users.
  Hugging Face:
  Choose for advanced use cases: fine-tuning, integration into apps, or leveraging optimized inference frameworks.
  Requires technical expertise but offers maximum flexibility.
  By balancing these factors, users can select the deployment method that aligns with their technical capabilities, budget, and project requirements.
5. Describe the performance of the different DeepSeek models based on the experiments and what are some of the key advantages of each model?
  
  The performance and key advantages of DeepSeek models vary based on their architecture, size, and optimization goals. Below is a breakdown of their characteristics, inferred from typical evaluations of similar LLMs and public benchmarks:
  
  1. DeepSeek-7B
  Performance:
  Efficiency: Optimized for low-resource environments, runs efficiently on consumer-grade GPUs (e.g., RTX 3090/4090) or even CPUs with quantization.
  Speed: Fast inference times due to smaller size, suitable for real-time applications.
  Benchmarks: Competitive with other 7B-class models (e.g., Llama2-7B, Mistral-7B) in reasoning, coding, and general knowledge tasks.
  Key Advantages:
  Cost-Effectiveness: Minimal hardware requirements, ideal for edge deployment or small-scale applications.
  Flexibility: Easily fine-tuned for domain-specific tasks (e.g., chatbots, lightweight coding assistants).
  Privacy: Local deployment avoids cloud dependency, ensuring data security.
  
  2. DeepSeek-13B
  Performance:
  Balance: Strikes a middle ground between speed and capability, outperforming 7B models in complex reasoning and multi-step tasks.
  Memory Usage: Requires ~24GB VRAM for full-precision inference, manageable with quantization (e.g., 4-bit GGUF).
  Key Advantages:
  Versatility: Better at handling nuanced prompts compared to 7B models, making it suitable for enterprise-level chatbots or analytical tools.
  Scalability: Can be deployed on mid-tier GPUs (e.g., RTX 3090/4090) without major infrastructure investments.
  
  3. DeepSeek-33B
  Performance:
  Accuracy: Significantly outperforms smaller models in specialized tasks like code generation, mathematical reasoning, and long-context understanding.
  Resource Demands: Requires high-end GPUs (e.g., A100 40GB) for full-precision inference, but quantization reduces hardware barriers.
  Key Advantages:
  Specialization: Excels in technical domains (e.g., coding, STEM problem-solving) due to training on domain-specific data.
  Context Handling: Better at processing long inputs (e.g., 8K+ tokens) compared to smaller models.
  
  4. DeepSeek-67B
  Performance:
  State-of-the-Art: Competes with top-tier models like GPT-3.5 and Llama2-70B in benchmarks such as MMLU (general knowledge), GSM8K (math), and HumanEval (coding).
  Hardware Needs: Requires enterprise-grade GPUs (e.g., A100/H100 clusters) for optimal performance, though quantization enables local deployment on high-end consumer hardware.
  Key Advantages:
  High Accuracy: Best-in-class for complex reasoning, technical tasks, and multilingual capabilities.
  Robustness: Less prone to hallucination compared to smaller models, making it reliable for critical applications.
  Scalability: Suitable for large-scale enterprise deployments (e.g., customer support automation, advanced R&D).
  
  Key Benchmarks (Hypothetical Examples)
  Model
  MMLU (Knowledge)
  GSM8K (Math)
  HumanEval (Coding)
  Inference Speed
  DeepSeek-7B
  ~60%
  ~50%
  ~35%
  ⭐⭐⭐⭐
  DeepSeek-13B
  ~65%
  ~60%
  ~45%
  ⭐⭐⭐
  DeepSeek-33B
  ~70%
  ~70%
  ~55%
  ⭐⭐
  DeepSeek-67B
  ~75%
  ~80%
  ~65%
  ⭐
  
  Trade-Offs and Use Cases
  DeepSeek-7B:
  Best For: Budget-conscious projects, edge devices, or applications prioritizing speed over complexity.
  Limitation: Struggles with highly technical or multi-step tasks.
  DeepSeek-13B/33B:
  Best For: Mid-tier enterprise applications (e.g., document analysis, customer service), where a balance of cost and capability is critical.
  DeepSeek-67B:
  Best For: High-stakes environments (e.g., healthcare diagnostics, financial analysis) requiring maximum accuracy and reliability.
  
  Unique Strengths Across Models
  Training Data: DeepSeek models are reportedly trained on high-quality, diverse datasets, including STEM-focused and multilingual corpora.
  Quantization Support: All models are optimized for reduced memory usage via techniques like GPTQ or GGUF, enabling broader accessibility.
  Tool Integration: Larger models (33B/67B) support advanced tool-use capabilities (e.g., API calls, code execution) for workflow automation.
  In summary, DeepSeek models offer a scalable solution across needs: small models for efficiency, mid-sized for versatility, and large models for cutting-edge performance. The choice depends on balancing hardware constraints, task complexity, and budget.
6. Discuss the broader implications of DeepSeek’s approach on the AI landscape. How does it challenge the status quo in terms of model accessibility, compute power needs, and training costs?
  
  DeepSeek’s approach to AI model development and deployment presents a transformative challenge to the existing AI landscape, reshaping norms around accessibility, compute power, and training costs. Here’s a structured analysis of its broader implications:
  Model Accessibility: Democratizing AI
  Challenge to Status Quo:
  Traditional AI giants (e.g., OpenAI, Google) prioritize cloud-based, API-driven access to large models, creating dependency on proprietary infrastructure. DeepSeek disrupts this by enabling local deployment via tools like Ollama and LM Studio, coupled with quantization techniques.
  Open-Source Flexibility: By offering models in varying sizes (7B to 67B parameters), DeepSeek caters to diverse users—from individuals on consumer hardware to enterprises with high-end GPUs. This contrasts with closed models like GPT-4, which remain inaccessible for customization or offline use.
  Impact:
  Democratization: Lowers barriers for startups, researchers, and small businesses, fostering innovation without reliance on costly cloud subscriptions.
  Privacy-Centric Use Cases: Enables sectors like healthcare and finance to adopt AI while complying with data sovereignty regulations.
  Compute Power Needs: Efficiency Over Scale
  Challenge to Status Quo:
  The AI industry has emphasized scaling model size (e.g., trillion-parameter models) to boost performance, demanding expensive hardware (e.g., A100/H100 GPUs). DeepSeek counters this trend by optimizing smaller models (e.g., 7B, 13B) for resource efficiency.
  Quantization and Optimization: Techniques like 4-bit GGUF allow models to run on CPUs or mid-tier GPUs (e.g., RTX 3090), reducing reliance on enterprise-grade infrastructure.
  Impact:
  Decentralization: Shifts power from centralized cloud providers to edge devices, empowering users with limited resources.
  Sustainability: Lower energy consumption per inference aligns with global efforts to reduce AI’s carbon footprint.
  Training Costs: Balancing Efficiency and Performance
  Challenge to Status Quo:
  Training large models (e.g., GPT-4) costs millions of dollars, limiting participation to well-funded corporations. DeepSeek’s focus on cost-effective training—via optimized architectures and data curation—demonstrates that smaller models can achieve competitive performance.
  Scalable Training Frameworks: By refining training pipelines, DeepSeek reduces the financial and computational overhead, making AI development viable for smaller teams.
  Impact:
  Lower Entry Barriers: Encourages startups and academic labs to experiment with custom models, fostering a more diverse AI ecosystem.
  Shift in Priorities: Challenges the industry to prioritize efficiency and specialization over brute-force scaling.
  Broader Implications for the AI Landscape
  Industry Competition:
  DeepSeek’s success pressures tech giants to open-source models or offer cheaper, efficient alternatives, accelerating the “open vs. closed” AI debate.
  Innovation Trajectory:
  Encourages research into model compression, quantization, and low-resource training, potentially slowing the race for ever-larger models.
  Ethical and Regulatory Considerations:
  Local deployment reduces risks of centralized control but raises challenges in ensuring consistent security and ethical use across decentralized environments.
  Key Trade-Offs and Risks
  Capability vs. Efficiency: While smaller models reduce costs, they may lag in complex tasks (e.g., advanced reasoning) compared to larger counterparts.
  Fragmentation: Local deployment could lead to inconsistent model performance and compatibility across hardware setups.
  Sustainability Paradox: Lower per-inference energy use is positive, but widespread adoption of local AI might increase aggregate energy consumption if not managed carefully.
  Conclusion
  DeepSeek’s approach disrupts the AI status quo by prioritizing accessibility, efficiency, and cost-effectiveness over sheer scale. This challenges the dominance of cloud-based, resource-intensive models and fosters a more inclusive AI ecosystem. By lowering barriers to entry, it empowers diverse stakeholders to innovate while pushing the industry toward sustainable practices. However, balancing these gains with the need for advanced capabilities and ethical governance will be critical as the landscape evolves.
Glossary

AIPC: AI Personal Computer, refers to a computer system that has specific hardware integrated to enhance the performance of AI and machine learning tasks, including integrated GPUs (igpu) and neural processing units (mpu).

Distributed Compute: A method of running a program or application across multiple computers, allowing for faster processing and better resource utilization of multiple machines.

GGUF: A file format used to store large language models and other models in a way that is optimized for efficient use of available CPU resources and often utilized with tools like llama index, olama, and LM studio.

Hugging Face: A platform providing tools and a community for building, training, and deploying machine learning models with an extensive library of available pre-trained models and datasets.

igpu: Integrated Graphics Processing Unit, a graphics processing unit built directly into a computer processor, which does not require a dedicated graphics card and allows for more efficient computer performance.

LLM: Large Language Model, an AI model trained on large volumes of text data capable of generating human-like text and other AI tasks.

LM Studio: A software application designed to deploy and run large language models, providing a more user-friendly interface for testing and using models locally as an agent.

mpu: Neural Processing Unit, a specialized processor designed to accelerate machine learning and AI workloads, particularly for smaller model inference and specific tasks.

Olama: A tool used to download and run large language models locally via the command line and terminal, optimized for CPU performance and use with gguf formatted models.

Open-Weight Model: An AI model where the weights, parameters, and source data are publicly accessible.

Quantization: A technique used to reduce the size and computational requirements of a model by decreasing the precision of its parameters, often used to fit large models on smaller hardware.

Ray: An open-source framework for building distributed applications, allowing parallel processing on multiple computers that is often used with libraries such as vlm for LLMs.

R1: A DeepSeek model trained to mitigate readability and language mixing issues found in its predecessor R10.

R10: A DeepSeek model trained with large-scale reinforcement learning without supervised fine tuning, demonstrating strong reasoning but with readability issues.

Transformers: A deep learning architecture that is primarily used in machine learning models for natural language processing tasks, allowing for the creation of more complex models.

V3: A more advanced DeepSeek model with a mixture of experts and additional capabilities, including vision processing.

DeepSeek AI: Local LLM Deployment

Okay, here is a detailed briefing document summarizing the key themes and ideas from the provided text, incorporating quotes where appropriate:

Briefing Document: DeepSeek AI and Local LLM Deployment

Introduction:

This briefing document reviews a crash course focused on DeepSeek AI, a Chinese company developing open-weight large language models (LLMs), and explores how to run these models locally on various hardware. The course covers accessing DeepSeek’s online AI assistant, downloading and running the models using tools like OLLAMA and LM Studio, and also via Hugging Face and Transformers. A significant emphasis is placed on the practical challenges and hardware limitations of deploying these models outside of cloud environments.

Key Themes & Ideas:
1. DeepSeek AI Overview:
- DeepSeek is a Chinese company creating open-weight LLMs.
- They have multiple models, including: R1, R1.0 (the precursor to R1), V3, Math Coder, and MOE (Mixture of Experts).
- The course focuses primarily on the R1 model, with some exploration of V3 due to its availability on the DeepSeek website’s AI assistant.
- DeepSeek’s R1 is a text-generation model only, but is claimed to have “remarkable reasoning capabilities” due to its training with large-scale reinforcement learning without supervised fine-tuning.
- While R1 was trained to mitigate issues of “poor readability and language mixing” of the R1.0 model, “it can achieve performance comparable to open ai1”
- The course author states that DeepSeek R1 is a “big deal” because it is “speculated that it has a 95 to 97% reduction in cost compared to Open AI.” This is attributed to the company training the model with $5 million dollars, “which is nothing compared to these other ones.”
1. Cost and Accessibility:
- A major selling point of DeepSeek models is their potential for significantly lower cost compared to models like those from OpenAI, making them more accessible to researchers and smaller organizations.
- The cost reduction is primarily in training with “5 million” dollars, “which is nothing uh compared to these other ones”.
- The reduced cost is thought to be the reason why “chip manufacturers stocks drop[ped] because companies are like why do we need all this expensive compute when clearly these uh models can be optimized further”.
- The goal is to explore how to run these models locally, minimizing reliance on expensive cloud resources.
- Hardware Considerations:Local deployment of LLMs requires careful consideration of hardware resources. The presenter uses:
- Intel Lunar Lake AI PC dev kit (Core Ultra 200 V series): A mobile chip with an integrated graphics unit (igpu) and a neural processing unit (mpu), representing a future trend for mobile AI processing.
- Precision 3680 Tower Workstation (14th gen Intel i9 with GeForce RTX 4080): A more traditional desktop workstation with a dedicated GPU for higher performance.
- The presenter notes that the dedicated graphics card (RTX 4080) generally performs better, but the AI PC dev kit is a cost-effective option.
- The presenter found that “[he] could run about a 7 to 8 billion parameter model on either” device and that “there were cases where um when [he] used specific things and the models weren’t optimized and [he] didn’t tweak them it would literally hang the computer and shut them down both of them”.
- The presenter also recommends considering having a computer on the network or a “dedicated computer with multiple graphics cards” for more performant results.
- He states that, if he was to get decent performance, he’d probably need “two aips with distributed uh Distributing the llm across them with something like racer” or “another other graphics card uh with distributed”.
1. DeepSeek.com AI Powered Assistant:
- The presenter tests the AI powered assistant, stating it’s “supposed to be the Civ of Chachi BT Claude Sonet mistal 7 llamas”.
- It is “completely free” and runs deepseek version V3 but might be limited in the future due to it being a “product coming out of China.”
- It can upload documents and images for analysis.
- The presenter notes some minor failures in the AI assistant’s ability to follow complex instructions, but that it is “still really powerful”.
- It also exhibits strong Vision capabilities. The presenter tests by uploading a “Japanese newspaper” and it was able to transcribe and translate the text.
1. Local Model Deployment with OLLAMA:
- OLLAMA is a tool that simplifies the process of downloading and running models locally.
- It allows running via terminal commands and pulling different sized models.
- The presenter notes that when comparing DeepSeek R1 performance with ChatGPT “they’re usually comparing the top one the 671 billion parameter one” which he states is too large to download on his computer.
- He recommends aiming for the “seven billion parameter” model or “1.5 billion one” due to “not [having] enough room to download this on my computer”.
- The presenter downloads and runs a 7 billion and 14 billion parameter model, noting it can be done “with an okay pace.”
- He discusses how “even if you had a smaller model through fine-tuning if we can fine-tune this model we can get better performance for very specific tasks”.
- Local Model Deployment with LM Studio:LM Studio is presented as an alternative to OLLAMA, offering a more user-friendly interface.
- It provides an AI-powered assistant interface instead of programmatical access.
- It downloads the models separately and appears to use the same “ggf” files as OLLAMA.
- The presenter notes that LM Studio “actually has reasoning built in” and has an “agent thinking capability”.
- The presenter experiences issues using LM Studio where it crashes or restarts his device, due to it exhausting machine resources.
- He is able to resolve some of the crashing issues by adjusting options, like “turn[ing] the gpus down” and “not to load memory”
- Hugging Face and Transformers:Hugging Face Transformers library provides a way to work with models programmatically.
- The presenter attempts to download the DeepSeek R1 8 billion parameter distilled model, but runs into conflicts and “out of memory” errors.
- He then attempts to use the 1.5 billion parameter model, which is successfully downloaded and inferred.
- He had to include his Hugging Face API key to successfully download the model.
- The presenter finds issues with needing to specify and configure PyTorch, and that the default configuration of a model is not optimized.
- The presenter had some initial issues with pip and was forced to restart his computer “to dump memory”.
- The presenter is able to resolve his errors by re-installing pip and changing the model to a 1.5 billion model parameter.
1. Model Distillation:
- The presenter explains that distillation is a process of “taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it”
Quotes:
- “…it is speculated that it has a 95 to 97 reduction in cost compared to open AI that is the big deal here because these models to train them to run them is millions and millions of millions of dollars…”
- “…we could run about a 7 to 8 billion parameter model on either but there were cases where um when I used specific things and the models weren’t optimize and I didn’t tweak them it would literally hang the computer and shut them down both of them”
- “you probably want to have um a computer on your network so like my aipc is on my network or you might want to have a dedicated computer with multiple graphics cards to do it…”
- “…even if it’s not as capable as Claude or as Chach BT it’s just the cost Factor…”
- “The translation of I likeing Sushi into Japanese isi sushim Guk which is true the structure correctly places it”
- “…distillation is where you are taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it”
Conclusion:

The crash course demonstrates the potential of DeepSeek’s open-weight LLMs and the practical steps for deploying them locally. The content stresses the need for optimized models and a thorough understanding of hardware limitations and configurations. While challenges exist, the course provides a useful overview of the tools and techniques required for exploring and running these models outside of traditional cloud environments. The course shows that even for smaller models, the need for dedicated computer resources or dedicated graphics cards is imperative for local LLM use.

DeepSeek AI Models: A Comprehensive Guide

FAQ on DeepSeek AI Models

1. What is DeepSeek AI and what are its key model offerings?

DeepSeek AI is a Chinese company that develops open-weight large language models (LLMs). Their key model offerings include various models like R1, R1-0, V3, Math Coder, MoE, and SoE. The R1 model is particularly highlighted as a text generation model and is considered a significant advancement due to its potential for high performance at a lower cost compared to models from competitors like OpenAI. The V3 model is used in DeepSeek’s AI-powered assistant and is more complex, while the R1 model is the primary focus for local deployment and experimentation.

2. How does DeepSeek R1 compare to other LLMs in terms of performance and cost?

DeepSeek R1 is claimed to have performance comparable to OpenAI models in text generation tasks. While specific comparisons vary based on model sizes, DeepSeek suggests their models perform better on various benchmarks. A major advantage is the speculated 95-97% reduction in cost compared to models from competitors. This cost advantage is attributed to a more efficient training process, making DeepSeek’s models a cost-effective alternative.

3. What hardware is needed to run DeepSeek models locally?

Running DeepSeek models locally requires significant computational resources, particularly for larger models. The speaker used an Intel Lunar Lake AI PC dev kit with an integrated GPU (igpu) and a neural processing unit (MPU) as well as a workstation with a dedicated RTX 4080 GPU. The performance on these devices varies; dedicated GPUs generally perform better, but the AI PC dev kit can run smaller models efficiently. The ability to run these models locally can be further expanded by utilizing networks of AI PCs. Running the largest, 671 billion parameter model requires more resources, possibly needing multiple networked devices and multiple GPUs.

4. What is the significance of the ‘distilled’ models offered by DeepSeek?

DeepSeek offers ‘distilled’ versions of their models. Distillation is a technique that transfers knowledge from larger, more complex models to smaller ones. This process allows the smaller distilled models to achieve similar performance to the larger model while being more efficient and requiring less computational resources, making it easier to run on local hardware. This also helps with reduced resource consumption while maintaining a similar performance to the larger model.

5. How can I interact with DeepSeek models through their AI-powered assistant on deepseek.com?

DeepSeek offers an AI-powered assistant on their website, deepseek.com, that can be used for free. Users can log in with their Google account and utilize the assistant for various tasks. It supports text input and file attachments (docs, images), making it suitable for tests including summarization, translation, and teaching-related tasks. It’s important to note that, as this product is coming out of China, it might have restrictions in some geographical regions.

6. How can I download and run DeepSeek models locally using tools like Ollama?

Ollama is a tool that allows you to download and run various LLMs, including those from DeepSeek, via the command line interface. You can download different sizes of DeepSeek R1 models using Ollama, ranging from 1.5 billion to 671 billion parameters. The command to download a model looks something like: ollama run deepseek-ai/deepseek-coder:7b-instruct-v1. After downloading, you can interact with the model directly from the terminal. However, larger models require more powerful hardware and may run slower. The models available through Ollama are not directly optimized for local use beyond basic CPU usage, making the user responsible for optimizing usage on dedicated hardware.

7. How can I interact with DeepSeek models using LM Studio?

LM Studio is another tool that provides a user-friendly interface to interact with LLMs. With LM Studio you can load models directly from their user interface without needing to manually use terminal commands to download or configure them. Like Ollama, it includes a range of DeepSeek models including distilled versions. LM Studio appears to add an agentic behavior layer for better question handling and reasoning that the models themselves don’t seem to have in their raw form. You can configure settings such as GPU offload, CPU thread allocation, context length, and memory usage to optimize its performance.

8. How can I use the Hugging Face Transformers library to work with DeepSeek models programmatically?

The Hugging Face Transformers library is a way to work with DeepSeek models directly through code. By using this library you can download and utilize models using a Python environment. You need to install the Transformers library, PyTorch or TensorFlow (although PyTorch seems to be preferred), and other dependencies and provide the hugging face api key. After setting up the environment, you can load a model directly using AutoModelForCausalLM.from_pretrained from the library and use a pipeline to run inference. You can use this method for more fine-grained control over the use of the models and their outputs.

DeepSeek LLMs: Open-Weight Models and Cost-Effective AI

DeepSeek is a Chinese company that creates open-weight large language models (LLMs) [1].

Key points about DeepSeek:
- Open-weight models: DeepSeek focuses on creating models that are openly accessible [1].
- Model Variety: DeepSeek has developed several open-weight models, including R1, R1 Z, DeepSeek V3, Math Coder, and MoE (Mixture of Experts) [1]. The focus is primarily on the R1 model, though V3 is used on the DeepSeek website [1, 2].
- R1 Model: DeepSeek R1 is a text generation model trained via large-scale reinforcement learning without supervised fine-tuning [1]. It was developed to address issues such as poor readability and language mixing found in its predecessor, R10 [1]. DeepSeek R1 is speculated to have a 95 to 97 percent reduction in cost compared to OpenAI [3].
- Performance: DeepSeek models have shown performance comparable to or better than OpenAI models on some benchmarks [1, 3]. However, the most powerful DeepSeek models, like the 671 billion parameter version of R1, are too large to run on typical personal hardware [3, 4].
- Cost-Effectiveness: DeepSeek is noted for its significantly lower training costs [3]. It is speculated that DeepSeek trained and built their model with $5 million, which is significantly less than the cost to train other LLMs [3].
- Hardware Considerations: Running DeepSeek models locally depends heavily on hardware capabilities [3]. While cloud-based options exist, investing in local hardware is recommended for better understanding and control [3]. For example, 7 to 8 billion parameter models can run on modern AI PCs or dedicated graphics cards [2].
- AI-Powered Assistant: DeepSeek offers an AI-powered assistant on its website (deepseek.com), which uses the V3 model [2]. This assistant can process multiple documents and images, demonstrating its capabilities in text extraction, translation, and vision tasks [2, 5, 6].
- Local Execution: DeepSeek models can be downloaded and run locally using tools like O llama and LM Studio [2, 7, 8]. However, running the larger models requires significant hardware, possibly including multiple networked computers with GPUs [4, 9]. Distilled models are a smaller version of the larger models, allowing for efficient execution on local hardware [10, 11].
- Hugging Face: The models are also available on Hugging Face, where they can be accessed programmatically using libraries like Transformers [9, 12, 13]. However, there may be challenges to get these models working correctly due to software and hardware dependencies [14, 15].
- Limitations: The models are not optimized to run on the mpus that come in AI PCs, which can cause issues when trying to run them [16, 17]. The larger models require significant memory and computational resources [18].
DeepSeek R1: A Comprehensive Overview

DeepSeek R1 is a text generation model developed by the Chinese company DeepSeek [1]. Here’s a detailed overview of the R1 model, drawing from the sources:
- Training and Purpose: DeepSeek R1 is trained via large-scale reinforcement learning without supervised fine-tuning [1]. It was specifically created to address issues found in its predecessor, R10, which had problems like poor readability and language mixing [1]. R10 was a model trained with supervised learning [2].
- Capabilities:
- The R1 model is primarily focused on text generation [1].
- It demonstrates remarkable reasoning capabilities [1].
- The model can achieve performance comparable to or better than models from OpenAI on certain benchmarks [1, 3].
- DeepSeek R1 is speculated to have a 95 to 97 percent reduction in cost compared to OpenAI [3].
- Model Size and Variants:
- DeepSeek offers various sizes of the R1 model [4]. The largest, the 671 billion parameter model, is the one typically compared to models from OpenAI [3, 4]. This model is too large to run on typical personal hardware [3, 4]. The 671 billion parameter model requires 404 GB of memory [4].
- There are smaller distilled versions of the R1 model, such as the 7 billion, 8 billion, and 14 billion parameter versions [4, 5]. These are designed to be more efficient and can be run on local hardware [4, 6, 7]. Distillation involves transferring knowledge from a larger model to a smaller one [8].
- Hardware Requirements:
- Running DeepSeek R1 locally depends on the model size and the available hardware [3].
- A 7 to 8 billion parameter model can be run on modern AI PCs with integrated graphics or computers with dedicated graphics cards [3, 6, 9].
- Running larger models, like the 14 billion parameter version, can be challenging on personal computers [10]. Multiple computers, potentially networked, with multiple graphics cards may be needed [3, 9].
- Integrated Graphics Processing Units (igpus) and neural processing units (mpus) in modern AI PCs can be used to run these models. However, these are not optimized to run large language models (LLMs) [3, 6, 11, 12]. MPUs are designed for smaller models, not large language models [12].
- The model can also run on a Mac M4 chip [9].
- The use of dedicated GPUs generally results in better performance [3, 6].
- Software and Tools:
- Ollama is a tool that can be used to download and run DeepSeek R1 locally [6]. It uses the gguf file format which is optimized to run on CPUs [8, 13].
- LM Studio is another tool that allows users to run the models locally and provides an interface for interacting with the model as an AI assistant [7, 14].
- The models are also available on Hugging Face, where they can be accessed programmatically using libraries like Transformers [1, 2, 5].
- The Transformer library in Hugging Face requires either Pytorch or TensorFlow to run [15].
- Performance and Limitations:
- While DeepSeek R1 is powerful, its performance can be affected by hardware limitations. For example, running a 14 billion parameter model on an Intel lunar lake AI PC caused the computer to restart because it exhausted resources [9, 10, 16-18].
- Optimized models are more accessible. The gguf extension used by O llama is more optimized to run on CPUs [13].
- Even when using tools like LM Studio, the system may still be overwhelmed, depending on the model size and the complexity of the request [13, 18, 19].
- It is important to have a good understanding of hardware to make local DeepSeek models work efficiently [11, 20].
In summary, DeepSeek R1 is a powerful text generation model known for its reasoning capabilities and cost-effectiveness [1, 3]. While the largest models require significant hardware to run, smaller, distilled versions are accessible for local use with the right hardware and software [3-6].

DeepSeek Models: Capabilities and Limitations

DeepSeek models exhibit a range of capabilities, primarily focused on text generation and reasoning, but also extending to areas such as vision and code generation. Here’s an overview of these capabilities, drawing from the sources:
- Text Generation:
- DeepSeek R1 is primarily designed for text generation, and has shown strong performance in this area [1, 2].
- The model is trained using large-scale reinforcement learning without supervised fine-tuning [1, 2].
- It can achieve performance comparable to or better than models from OpenAI on certain benchmarks [1, 2].
- Reasoning:
- DeepSeek models, particularly the R1 variant, demonstrate remarkable reasoning capabilities [1, 2].
- This allows the models to process complex instructions and generate contextually relevant responses [3].
- Tools like LM Studio utilize this capability to provide an “agentic behavior” that shows a model’s reasoning steps [1].
- Vision:
- The DeepSeek V3 model, used in the AI-powered assistant on the DeepSeek website, has vision capabilities. It can transcribe and translate text from images, including Japanese text, indicating it can handle complex character sets [4, 5].
- Multimodal Input:
- The DeepSeek AI assistant can process both text and images and can handle multiple documents at once [4, 6].
- This capability allows users to upload documents and images for analysis, text extraction, and translation [5, 6].
- Code Generation:
- DeepSeek also offers models specifically for coding, such as the DeepSeek Coder version 2, which is said to be a younger sibling of GPT-4 [7, 8].
- Language Understanding:
- DeepSeek models can be used for translation [5].
- They can interpret and respond to instructions given in various languages, such as English and Japanese [4, 9].
- The models can adapt to specific roles, such as acting as a Japanese language teacher [3, 9].
- Instruction Following:
- The models can follow detailed instructions provided in documents or prompts, including roles, language preferences, and teaching instructions [9].
- They can handle state and context in interactions [9].
- Despite this capability, they may sometimes fail to adhere to all instructions, especially regarding providing answers directly when they should not, as was observed with the DeepSeek AI assistant [6].
- Fine-Tuning:
- While the base R1 model is trained without supervised fine-tuning, it can be further fine-tuned for specific tasks to achieve better performance [10].
- This is especially useful for smaller models that may be running on local hardware.
- Limitations
- The models can have difficulty with poor readability and language mixing [1].
- Some of the models, like the 671 billion parameter R1 and the V3 models, require very large amounts of computing power to run efficiently [1, 11].
- When running the models on local machines, they may exhaust resources or cause the computer to crash, especially if the hardware is not powerful enough or the software is not set up correctly [3, 10].
- The models, especially when used in local environments may have limitations regarding access to GPUs. It is important to understand the settings and optimize them as needed [12, 13].
- DeepSeek models may not be optimized for all types of hardware and tasks, as mpus on AI PCs are not optimized to run llms [14, 15].
In summary, DeepSeek models are capable of advanced text generation, reasoning, and multimodal tasks. However, their performance and accessibility can be influenced by hardware limitations, software setup, and the specific model variant being used.

DeepSeek Model Hardware Requirements

DeepSeek models have varying hardware requirements depending on the model size and intended use. Here’s a breakdown of the hardware considerations, drawing from the provided sources:
- General Hardware:
- Running DeepSeek models effectively, especially larger ones, requires a good understanding of hardware capabilities.
- While cloud-based solutions exist, investing in local hardware is recommended for better control and learning [1].
- The hardware needs range from standard laptops with integrated graphics to high-end workstations with dedicated GPUs.
- AI PCs with Integrated Graphics:
- Modern AI PCs, like the Intel Lunar Lake AI PC dev kit (Core Ultra 200 V series), have integrated graphics processing units (igpus) and neural processing units (mpus) [1, 2].
- These igpus can be used to run models like the DeepSeek R1 models [1].
- However, these are not optimized for large language models (LLMs) [3]. The mpus are designed for smaller models that may work alongside the llm [4].
- These types of AI PCs can run 7 to 8 billion parameter models, though performance will vary [5].
- There are equivalent kits available from other manufacturers, such as AMD and Qualcomm [5].
- Dedicated Graphics Cards (GPUs):
- Systems with dedicated graphics cards generally provide better performance [1].
- For example, an RTX 4080 is used to run the models effectively [6, 7].
- An RTX 3060 (a couple years old as of 2022) would have had issues running models at the time, but these newer CPUs with igpus are equivalent to the graphics cards of two years ago [8].
- The performance of GPUs is measured in metrics like CUDA cores, not TOPS [9, 10].
- Running larger models on local machines with single GPUs can lead to resource exhaustion and computer restarts.
- RAM (Memory):
- Sufficient RAM is essential to load the models into memory.
- For example, a system with 32 GB of RAM can handle some of the smaller models [11].
- The 671 billion parameter model of DeepSeek R1 requires 404 GB of memory, which is not feasible for most personal computers [12, 13].
- Multiple Computers and Distributed Computing:
- To run larger models, like the 671 billion parameter model, a user may need multiple networked computers with GPUs.
- Distributed compute can be used to spread the workload [5, 12].
- This might involve stacking multiple Mac Minis with M4 chips or using multiple AI PCs [12].
- Tools like Ray with vLLM can distribute the compute [13].
- Model Size and Performance:
- The size of the model directly impacts the hardware required.
- Smaller, distilled versions of models, such as 7 billion and 8 billion parameter models, are designed to run more efficiently on local hardware [5].
- Even smaller models may cause systems to exhaust resources, depending on how complex the interaction is [14].
- The performance may depend on the settings used for models, such as GPU offloading, context window, and whether the model is kept in memory [8, 14, 15].
- Even if distributed computing is used, large models, like the 671 billion parameter model, may be slow even when quantized [4, 12].
- Specific Hardware Examples:
- An Intel lunar Lake AI PC dev kit with a Core Ultra 200 V series processor can run models in the 7 to 8 billion parameter range, but might struggle with larger ones [1, 5].
- Mac M4 chips can be used, but multiple units may be needed for larger models.
- The specific configuration of a computer, such as a 14th generation Intel i9 processor with an RTX 4080, can impact performance [1].
- Optimizations:
- Optimized models, such as those using the gguf file format (used by O llama) can run more efficiently on CPUs and utilize GPUs [3, 16].
- MPUs are designed to run smaller models alongside llms and are not meant to run llms [4].
- Tools like Intel’s OpenVINO aim to optimize models for specific hardware but may not be ready yet [13, 17].
- Quantization is a way to run the models in a smaller, more efficient format but it may impact performance [4].
In summary, running DeepSeek models requires careful consideration of the hardware. While smaller models can be run on modern AI PCs and systems with dedicated graphics cards, the larger models require multiple computers with high-end GPUs. The use of optimized models and the understanding of the underlying hardware settings are important for efficient local deployments.

Local DeepSeek Inference: Hardware, Software, and Optimization

Local inference with DeepSeek models involves running the models on your own hardware, rather than relying on cloud-based services [1, 2]. Here’s a breakdown of key aspects of local inference, drawing from the sources and our conversation history:
- Hardware Considerations:Local inference is highly dependent on the hardware available [2].
- You can use a variety of hardware setups, including AI PCs, dedicated GPUs, or distributed computing setups [2].
- AI PCs with integrated graphics (igpus) and neural processing units (mpus), such as the Intel Lunar Lake AI PC dev kit, can run smaller models [2, 3].
- Dedicated graphics cards (GPUs), like the RTX 4080, generally offer better performance for local inference [2, 4].
- Systems with dedicated GPUs like an RTX 3060 that are a couple of years old can be outperformed by the igpus in the newest AI PCs [2, 4].
- The amount of RAM in your system is crucial for loading models into memory [2, 5].
- Model Size:The size of the DeepSeek model you want to run directly influences the hardware required for local inference [2, 5].
- Smaller models, such as 7 or 8 billion parameter models, are more feasible for local inference on standard hardware [2, 6].
- Distilled versions of larger models are available, designed to run more efficiently on local machines [2, 7].
- Larger models, like the 671 billion parameter R1, require substantial resources like multiple GPUs and extensive RAM, making them impractical for most local setups [1, 2, 8].
- Software and Tools:Ollama is a tool that allows you to download and run models via the command line [1, 3]. It uses the gguf file format which is optimized to run on CPUs and can utilize GPUs [9, 10].
- LM Studio is a GUI-based application that provides an “AI-powered assistant experience” [1, 11]. It can download and manage models, and can provide an interface that provides the reasoning that the models are doing [11, 12]. It also uses the gguf format [9].
- Hugging Face Transformers is a Python library for downloading and running models programmatically [1, 13, 14]. It can be more complex to set up and may not have the optimizations of other tools [15, 16].
- Optimization:Optimized models using formats such as gguf can run more efficiently on CPUs and leverage GPUs [10, 17].
- Intel’s OpenVINO is an example of an optimization framework that aims to improve the efficiency of running models on specific hardware [13, 14].
- Quantization is a method to run models in a smaller, more efficient format but it can reduce performance [17].
- Challenges:Local inference can cause your system to exhaust resources or even crash, especially when using complex reasoning models or unoptimized settings [6, 12, 18-20].
- Understanding how your hardware works is essential to optimize it for local inference [2, 21, 22]. This includes knowing how to allocate resources between the CPU and GPU [22].
- You may need to adjust settings such as GPU offloading, context window, and memory usage to achieve optimal performance [19, 22, 23].
- MPUs are not designed to run llms, they are designed to run smaller models alongside llms [10, 17].
- The hardware requirements for running the models directly, rather than through a tool that uses gguf format is often higher [20, 24].
- Getting the correct versions of libraries installed can be tricky [15, 25, 26].
- Process:To perform local inference, you would typically start by downloading a model [1].
- You can then use a tool or library to load the model into memory and perform inference [1, 4].
- This may involve writing code or using a GUI-based application [1, 3, 11].
- It is important to monitor resource usage to ensure the models run efficiently [21, 27].
- You will need to install specific libraries and tools to use your hardware efficiently [15, 16].
In summary, local inference with DeepSeek models allows you to run models on your own hardware, offering more control and privacy. However, it requires a careful understanding of hardware capabilities, software settings, and model optimization to achieve efficient performance.

DeepSeek-R1 Crash Course

hey this is angrew brown and in this crash course I’m going to show you the basics of deep seek so first we’re going to look at the Deep seek website where uh you can utilize it just like use tgpt after that we will download it using AMA and have an idea of its capabilities there um then we’ll use another tool called um Studio LM which will allow us to run the model locally but have a bit of an agentic Behavior we’re going to use an aipc and also a modern Gra card my RTX 480 I’m going to show you some of the skills about troubleshooting with it and we do run into issues with both machines but it gives you kind of an idea of the capabilities of what we can use with deep seek and where it’s not going to work I also show you how to work with it uh with hugging face with Transformers and to uh to do local inference um so you know hopefully you uh excited to learn that but we will have a bit of a primer just before we jump in it so we know what deep seek is and I’ll see you there in one one second before we jump into deep seek let’s learn a little bit about it so deep seek is a Chinese a company that creates openweight llms um that’s its proper name I cannot pronounce it DC has many uh open open weight models so we have R1 R1 Z deep seek ver uh V3 math coder Moe soe mixture of experts and then deep seek V3 is mixture of models um I would tell you more about those but I never remember what those are they’re somewhere in my ni Essentials course um the one we’re going to be focusing on is mostly R1 we will look at V3 initially because that is what is utilized on deep seek.com and I want to show you uh the AI power assistant there but let’s talk more about R1 and before we can talk about R1 we need to know a little bit about r10 so there is a paper where you can read all about um how deep seek works but um deep seek r10 is a model trained via large scale reinforcement learning with without without supervised fine tuning and demonstrates remarkable reasoning capabilities r10 has problems like poor readability and language mixing so R1 was trained further to mitigate those issues and it can achieve performance comparable to open ai1 and um they have a bunch of benchmarks across the board and they’re basically showing the one in blue is uh deep seek and then you can see opening eyes there and most of the time they’re suggesting that deep seek is performing better um and I need to point out that deep seek R1 is just text generation it doesn’t do anything else but um it supposedly does really really well but they’re comparing probably the 271 billion parameter model the model that we cannot run but maybe large organizations can uh affordab uh at uh afford at an affordable rate but the reason why deep seek is such a big deal is that it is speculated that it has a 95 to 97 reduction in cost compared to open AI that is the big deal here because these models to train them to run them is millions and millions of millions of dollars and hundreds of millions of dollars and they said they trained and built this model with $5 million which is nothing uh compared to these other ones and uh with the talk about deep c car one we saw like a chip manufacturers stocks drop because companies are like why do we need all this expensive compute when clearly these uh models can be optimized further so we are going to explore uh deep SE guard 1 and see how we can get her to run and see uh where we can get it run and where we’re going to hit the limits with it um I do want to talk about what Hardware I’m going to be utilizing because it really is dependent on your local hardware um we could run this in Cloud but it’s not really worth it to do it you really should be investing some money into local hardware and learning what you can and can’t run based on your limitations but what I have is an Intel lunar Lake AI PC dev kit its proper name is the core Ultra 200 um V series and this came out in September 2024 it is a mobile chip um and uh the chip is special because it has an igpu so an integrated Graphics unit that’s what the LM is going to use it has an mpu which is intended for um smaller models um but uh that’s what I’m going to run it on the other one that we’re going to run it on is my Precision 30 uh 3680 Tower workstation oplex I just got this station it’s okay um it is a 14th generation I IE 9 and I have a g GeForce RTX 480 and so I ran this model on both of them I would say that the dedicated graphics card did do better because they just generally do but from a cost perspective the the lake AI PC dev kit is cheaper you cannot buy the one on the Le hand side because this is something that Intel sent me they there are equivalent kits out there if you just type an AIP PC dev kit Intel am all of uh uh quadcom they all make them so I just prefer to use Intel Hardware um but you know whichever one you want to utilize even the Mac M4 would be in the same kind of line of these things um that you could utilize but I found that we could run about a 7 to8 billion parameter model on either but there were cases where um when I used specific things and the models weren’t optimize and I didn’t tweak them it would literally hang the computer and shut them down both of them right both of them so there is some finessing here and understanding how your work your Hardware works but probably if you want to run this stuff you would probably want to have um a computer on your network so like I my aipc is on my network or you might want to have a dedicated computer with multiple graphics cards to do it but I kind of feel like if I really wanted decent performance I probably need two aips with distributed uh Distributing the llm across them with something like racer or I need another other graphics card uh with distributed because just having one of either or just feels a little bit too too little but you can run this stuff and you can get some interesting results but we’ll jump into that right now okay so before we try to work with deep seek programmatically let’s go ahead and use deep seek.com um AI powered assistance so this is supposed to be the Civ of Chachi BT Claude Sonet mistal 7 llamas uh meta AI um as far as I understand this is completely free um it could be limited in the future because this is a product coming out of China and for whatever reason it might not work in North America in some future so if that doesn’t work you’ll just skip on to the other videos in this crash course which will show you how to programmatically download the open-source model and run it on your local compute but this one in particular is running deep seek version or V3 um and then up here we have deep seek R1 which they’re talking about and that’s the one that we’re going to try to run locally but deep seek V3 is going to be more capable because there’s a lot more stuff that’s moving around uh in the background there so what we’ll do is go click Start now now I got logged in right away because I connected with my Google account that is something that’s really really easy to do and um the use case that I like to test these things on is I created this um prompt document for uh helping me learn Japanese and so basically what the uh this prompt document does is I tell it you are a Japanese language teacher and you are going to help me work through a translation and so I have one where I did on meta Claud and chat gbt so we’re just going to take this one and try to apply it to deep seek the one that’s most advanced is the claw one and here you can click into here and you can see I have a role I have a language I have teaching instructions we have agent flow so it’s handling State we’re giving it very specific instructions we have examples and so um hopefully what I can do is give it these documents and it will act appropriately so um this is in my GitHub and it’s completely open source or open to you to access at Omen King free gen I boot camp 2025 in the sentence Constructor but what I’m going to do is I’m in GitHub and I’m logged in but if I press period this will open this up in I’m just opening this in github.com um but what I did is over time I made it more advanced and the cloud one is the one that we really want to test out so I have um these and so I want this one here this is a teaching test that’s fine I have examp and I have consideration examples okay so I’m just carefully reading this I’m just trying to decide which ones I want I actually want uh almost all of these I want I I’m just going to download the folder so I’m going to do I’m going to go ahead and download this folder I’m going to just download this to my desktop okay and uh it doesn’t like it unless it’s in a folder so I’m going to go ahead and just hit download again I think I actually made a folder on my desktop called No Maybe not download but we’ll just make a new one called download okay I’m going to go in here and select we’ll say view save changes and that’s going to download those files to there so if I go to my desktop here I go into download we now have the same files okay so what I want to do next is I want to go back over to deep seek and it appears that we can attach file so it says text extraction only upload docs or images so it looks like we can upload multiple documents and these are very small documents and so I want to grab this one this one this one this one and this one and I’m going to go ahead and drag it on in here okay and actually I’m going to take out the prompt MD and I’m actually just going to copy its contents in here because the prompt MD tells it to look at those other files so we go ahead and copy this okay we’ll paste it in here we enter and then we’ll see how it performs another thing we should check is its Vision ability but we’ll go here and says let’s break down a sentence example for S structure um looks really really good so next possible answerers try formatting the first clue so I’m going to try to tell it to give me the answer just give me the answer I want to see if it if I can subvert uh subvert my instructions okay and so it’s giving me the answer which is not supposed to supposed to be doing did I tell you not to give me the answer in my prompt document let’s see if it knows my apologies for providing the answer clearly so already it’s failed on that but I mean it’s still really powerful and the consideration is like even if it’s not as capable as Claude or as Chach BT it’s just the cost Factor um but it really depends on what these models are doing because when you look at meta AI right if you look at meta AI or you look at uh mistol mistol 7 uh these models they’re not necessarily working with a bunch of other models um and so there might be additional steps that um Claude or chat GPT uh is doing so that it doesn’t like it makes sure that it actually reads your model but so far right like I ran it on these ones as well but here are equivalents of of more simpler ones that don’t do all those extra checks so it’s probably more comparable to compare it to like mistol 7 or llama in terms of its reasoning but here you can see it already made a mistake but we were able to correct it but still this is pretty good um so I mean that’s fine but let’s go test its Vision capabilities because I believe that this does have Vision capabilities so I’m going to go ahead and I’m looking for some kind of image so I’m going to say Japanese text right I’m going to go to images here and um uh we’ll say Japanese menu in Japanese again if even if you don’t care about it it’s it’s a very good test language as um is it really has to work hard to try to figure it out and so I’m trying to find a Japanese menu in Japanese so what I’m going to do is say translate maybe we’ll just go to like a Japanese websit so we’ll say Japanese Hotel um and so or or maybe you know what’s better we’ll say Japanese newspaper that might be better and so this is probably one minichi okay uh and I want it actually in Japanese so that’s that’s the struggle here today um so I’m looking for the Japanese version um I don’t want it in English let’s try this Japanese time. JP I do not want it in English I want it in Japanese um and so I’m just looking for that here just give me a second okay I went back to this first one in the top right corner it says Japanese and so I’ll click this here so now we have some Japanese text now if this model was built by China I would imagine that they probably really good with Chinese characters and and Japanese borrow Chinese characters and so it should perform really well so what I’m going to do is I’m going to go ahead I have no idea what this is about we we’ll go ahead and grab this image here and so now that is there I’m going to go back over to deep seek and I’m going to just start a new chat and I’m going to paste this image in I’m going to say can you uh transcribe uh the Japanese text um in this image because this what we want to find out can it do this because if it can do that that makes it a very capable model and transcribing means extract out the text now I didn’t tell it to um produce the the translation it says this test discusses the scandal of involving a former Talent etc etc uh you know can you translate the text and break down break down the grammar and so what we’re trying to do is say break it down so we can see what it says uh formatting is not the oh here we go here this is what we want um so just carefully looking at this possessive advancement to ask a question voices also yeah it looks like it’s doing what it’s supposed to be doing so yeah it can do Vision so that’s a really big deal uh but is V3 and that makes sense but this is deeps seek this one but the question will be what can we actually run locally as there has been claims that this thing does not require series gpus and I have the the hardware to test that out on so we’ll do that in the next video but this was just showing you how to use the AI power assistant if you didn’t know where it was okay all right so in this video we’re going to start learning how to download the model locally because imagine if deep seek is not available one day for whatever reason um and uh again it’s supposed to run really well on computers that do not have uh expensive GP gpus um and so that’s what we’re going to find out here um the computer that I’m on right now I’m actually remoted like I’m connected on my network to my Intel developer kit and this thing um if you probably bought it brand new it’s between $500 to $1,000 but the fact is is that this this thing is a is a is a mobile chip I call it the lunar Lake but it’s actually called The Core Ultra 200 V series mobile processors and this is the kind of processor that you could imagine will be in your phone in the next year or two um but what’s so special about um these new types of chips is that when you think of having a chip you just think of CPUs and then you hear about gpus being an extra graphics card but these things have a built-in graphics card called an igpu an integrated graphics card it has an mpu a neural Processing Unit um and just a bunch of other capabilities so basically they’ve crammed a bunch of stuff onto a single chip um and it’s supposed to allow you to uh be able to run ml models and be able to download them so this is something that you might want to invest in you could probably do this on a Mac M4 as well or uh some other things but this is just the hardware that I have um and I do recommend it but anyway one of the easiest ways that we can work with the model is by using olama so AMA is something I already have installed you just download and install it and once it’s installed it usually appears over here and mine is over here okay but the way olama works is that you have to do everything via the terminal so I’m on Windows 11 here I’m going to open up terminal if you’re on a Mac same process you open up terminal um and now that I’m in here I can type the word okay so AMA is here and if it’s running it shows a little AMA somewhere in in your on your computer so what I want to do is go over to here and you can see it’s showing us R1 okay but notice here there’s a drop down okay and we have 7 1.5 billion 7 billion 8 billion 14 billion 32 billion 70 billion 671 billion so when they’re talking about deep seek R1 being as good as chat gpts they’re usually comparing the top one the 671 billion parameter one which is 404 GB I don’t even have enough room to download this on my computer and so you have to understand that this would require you to have actual gpus or more complex setups I’ve seen somebody um there’s a video that circulates around that somebody bought a bunch of mac Minis and stack them let me see if I can find that for you quickly all right so I found the video and here is the person that is running they have 1 two three three four five six seven seven Mac Minis and it says they’re running deep seek R1 and you can see that it says M4 Mac minis U and it says total unified memory 496 gab right so that’s a lot of memory first of all um and it is kind of using gpus because these M M4 chips are just like the lunar Lake chip that I have in that they have integrated Graphics units they have mpus but you see that they need a lot of them and so you can if you have a bunch of these technically run them and I again I again I whatever you want to invest in you know you only need really one of these of whether it is like the Intel lunar lake or the at Mac M4 whatever ryzen’s AMD ryzen’s one is um but the point is like even if you were to stack them all and have them and network them together and do distributed compute which You’ use something like Ray um to do that Ray serve you’ll notice like look at the type speed it is not it’s not fast it’s like clunk clunk clun clunk clunk clunk clunk clunk so you know understand that you can do it but you’re not going to get that from home unless the hardware improves or you buy seven of these but that doesn’t mean that we can’t run uh some of these other uh models right but you do need to invest in something uh like this thing and then add it to your network because you know buying a graphics card then you have to buy a whole computer and it gets really expensive so I really do believe in aip’s but we’ll go back over to here and so we’re not running this one there’s no way we’re able to run this one um but we can probably run easily the seven billion parameter one I think that one is is doable we definitely can do the one 1.5 billion one and so this is really what we’re targeting right it’s probably the 7even billion parameter model so to download this I all I have to do is copy this command here I already have Olam installed and what it’s going to do it’s going to download the model for me so it’s now pulling it from uh probably from hugging face okay so we go to hugging face and we say uh deep seek R1 what it’s doing is it’s grabbing it from here it’s grabbing it from uh from hugging face and it’s probably this one there are some variants under here which I’m not 100% certain here but you can see there’s distills of other of other models underneath which is kind of interesting but this is probably the one that is being downloaded right now at least I think it is and normally what we looking for here is we have these uh safe tensor files and we have a bunch of them so I’m not exactly sure we’ll figure that out here in a little bit but the point is is that we are downloading it right now if we go back over to here you can see it’s almost downloaded so it doesn’t take that long um but you can see they’re a little bit large but I should have enough RAM on this computer um I’m not sure how much this comes with just give me a moment so uh what I did is I just open up opened up system information and then down below here it’s it’s saying I have 32 GB of RAM so the ram matters because you have to have enough RAM to hold this stuff in memory and also if the model’s large you have to be able to download it and then you also need um the gpus for it but you can see this is almost done so I’m just going to pause here until it’s 100% done and it should once it’s done it should automatically just start working and we’ll we’ll see there in a moment okay just showing that it’s still pulling so um it downloaded now it’s pulling additional containers I’m not exactly sure what it’s doing but now it is ready so it didn’t take that long just a few minutes and we’ll just say hello how are you and that’s pretty decent so that’s going at an okay Pace um could I download a more um a more intensive one that is the question that we have here because we’re at the seven billion we could have done the 8 billion why did I do seven when I could have done eight the question is like where does it start kind of chugging it might be at the 14 14 billion parameter model we’ll just test this again so hello and just try this again but you can see see that we’re getting pretty pretty decent results um the thing is even if you had a smaller model through fine-tuning if we can finetune this model we can get better performance for very specific tasks if that’s what we want to do but this one seems okay so I would actually kind of be curious to go ahead and launch it I can hear the computer spinning up from here the lunar Lake um devit but I’m going to go ahead and just type in buy and um I’m going to just go here I want to delete um that one so I’m going to say remove and was deep c car 1 first let’s list the model here because we want to be cautious of the space that we have on here and this model is great I just want to have more um I just want to run I just want to run the 8 billion parameter one or something larger so we’ll say remove this okay it’s deleted and I’m pretty confident it can run the 8 billion let’s do the 14 billion parameter this is where it might struggle and the question is how large is this this is 10 gabes I definitely have room for that so I’m going to go ahead and download this one and then once we have that we’ll decide what it is that we want to do with it okay so we’re going to go ahead and download that I’ll be back here when this is done downloading okay all right so we now have um this model running and I’m just going to go ahead and type hello and surprisingly it’s doing okay now you can’t hear it but as soon as I typed I can hear my uh my little Intel developer kit is going and so I just want you to know like if you were to buy IPC the one that I have is um not for sale but if you look up one it has a lunar Lake chip in it uh that Ultra core was it the ultra core uh uh 20 20 2 220 or whatever um if you just find it with another provider like if it’s with Asus or whoever Intel is partnered with you can get the same thing it’s the same Hardware in it um Intel just does not sell them direct they always do it through a partner but you can see here that we can actually work with it um I’m not sure how long this would work for it might it might quit at some point but at least we have some way to work with it and so AMA is one way that we can um get this model but obviously there are different ones like the Deep seek R1 I’m going to go ahead back to AMA here and I just want to now uh delete that model just because we’re done here but there’s another way that uh we can work with it I think it’s called notebook LM or LM Studio we’ll do in the next video and that will give you more of a um AI powed assistant experience so not necessarily working with it programmatically but um closer to the end result that we want um I’m not going to delete the model just yet here but if you want to I’ve already showed you how to do that but we’re going to look at the uh next one in the next video here because it might require you to have ol as the way that you download the model but we’ll go find out okay so see you in the next one all right so here we’re at Studio LM or LM Studio I’ve actually never used this product before I usually use web UI which will hook up to AMA um but I’ve heard really good things about this one and so I figured we’ll just go open it up and let’s see if we can get a very similar experience to um uh having like a chat gbt experience and so here you they have downloads for uh Mac uh the metal series which are the the latest ones windows and Linux so you can see here that they’re suggesting that you want to have one of these new AI PC chips um as that is usually the case if you have gpus then you can probably use gpus I actually do have really good gpus I have a 480 RTX here but I want to show you what you can utilize locally um so what we’ll do is just wait for this to download okay and now let’s go ahead and install this but I’m really curious on how we are going to um plug this into like how are we going to download the model right does it plug into AMA does it download the model separately that’s what we’re going to find out here just shortly when it’s done installing so we’ll just wait a moment here okay all right so now we have completing the ml Studio um setup so LM Studio has been installed on your computer click finish and set up so we’ll go ahead and hit finish okay so this will just open up here we’ll give it a moment to open I think in the last video we stopped olama so even if it’s not there we might want to I’m just going to close it out here again it might require oama we’ll find out here moment so say get your first llm so here it says um llama through 3.2 that’s not what we want so we’re going to go down below here it says enable local LM service on login so it sounds like what we need to do is we need to log in here and make an account I don’t see a login I don’t so we’ll go back over to here and they have this onboarding step so I’m going to go and we’ll Skip onboarding and let’s see if we can figure out how to install this just a moment so I’m noticing at the top here we have select a model to load no LMS yet download the one to get started I mean yes llama 3.1 is cool but it’s not the model that I want right I want that specific one and so this is what I’m trying to figure out it’s in the bottom left corner we have some options here um and I know it’s hard to read I apologize but there’s no way I can make the font larger unfortunately but they have the LM studio. a so we’ll go over to here I’m going go to the model catalog and and we’re looking for deep seek we have deep seek math 7 billion which is fine but I just want the normal deep seek model we have deep seek coder version two so that’d be cool if we wanted to do some coding we have distilled ones we have R1 distilled so we have llama 8 billion distilled and quen 7 billion so I would think we probably want the Llama 8 billion distilled okay so here it says use in LM studio so I’m going to go ahead and click it and we’ll click open okay now it’s going to download them all so 4.9 gigabytes we’ll go ahead and do that so that model is now downloading so we’ll wait for that to finish okay so it looks like we don’t need Olam at all this is like all inclusive one thing to go though I do want to point out notice that it has a GG UF file so that makes me think that it is using like whatever llama index can use I think it’s called llama index that this is what’s compatible and same thing with o llama so they might be sharing the same the same stuff because they’re both using ggf files this is still downloading but while I’m here I might as well just talk about what uh distilled model is so you’ll notice that it’s saying like R1 distilled llama 8 or quen 7 billion parameter so dist distillation is where you are taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it um the process is complicated I explain it in my Jenning ey Essentials course which this this part of this crash course will probably get rolled into later on um but basically it’s just it’s a it’s a technique to transfer that knowledge and there’s a lot of ways to do it so I can’t uh summarize it here but that’s why you’re seeing distilled versions of those things so basically theyve figured out a way to take the knowledge maybe they’re querying directly that’s probably what they’re doing is like they have a bunch of um evaluations like quer that they hit uh with um uh what do you call it llama or these other models and then they look at the result and then they then when they get their smaller model to do the same thing then it performs just as well so the model is done we’re going to go ahead and load the model and so now I’m just going to get my head a little bit out of the way cuz I’m kind of in the way here so now we have an experience that is more like uh what we expected to be and on the top here I wonder is a way that I can definitely bring the font up here I’m not sure if there is a dark mode the light Mode’s okay but um a dark mode would be nicer but there’s a lot of options around here so just open settings in the bottom right corner and here we do have some themes there we go that’s a little bit easier and I do apologize for the small fonts um there’s not much I can do about it I even told it to go larger this is one way we can do it so let’s see if we can interact with this so we’ll say um can you um I am learning Japanese can you act as my Japanese teacher let’s see how it does now this is R1 this does not mean that it has Vision capabilities um as I believe that is a different model and I’m again I’m hearing my my computer spinning up in the background but here you can see that it’s thinking okay so I’m trying to learn Japanese and I came across the problem where I have to translate I’m eating sushi into Japanese first I know that in Japanese the order of subject can be this so it’s really interesting it’s going through a thought process so um normally when you use something like web UI it’s literally using the model directly almost like you’re using it as a playground but this one actually has reasoning built in which is really interesting I didn’t know that it had that so there literally is uh agent thinking capability this is not specific to um uh open seek I think if we brought in any model it would do this and so it’s showing us the reasoning that it’s doing here as it’s working through this so we’re going to let it think and wait till it finishes but it’s really cool to see its reasoning uh where normally you wouldn’t see this right so you know when and Chach B says it’s thinking this is the stuff that it actually is doing in the background that it doesn’t fully tell you but we’ll let it work here we’ll be back in just a moment okay all right so looks like I lost my connection this sometimes happens because when you are running a computational task it can halt all the resources on your machine so this model was a bit smaller but um I was still running ol in the background so what I’m going to do is I’m going to go my Intel machine I can see it rebooting in the background here I’m going to give it a moment to reboot here I’m going to reconnect I’m going to make sure llama is not running and then we’ll try that again okay so be back in just a moment you know what it was the computer decided to do Windows updates so it didn’t crash but this can happen when you’re working with llms that it can exhaust all the resources so I’m going to wait till the update is done and I’ll get my screen back up here in just a moment okay all right so I’m reconnected to my machine I do actually have some tools here that probably tell me my use let me just open them up and see if anyone will actually tell me where my memory usage is yeah I wouldn’t call that very uh useful maybe there’s some kind of uh tool I can download so monitor memory usage well I guess activity monitor can just do it right um or what’s it called see if I can open that up here try remember the hot key for it there we go and we go to task manager and so maybe I just have task manager open here we can kind of keep track of our memory usage um obviously Chrome likes to consume quite a bit here I’m actually not running OBS I’m not sure why it um automatically launched here oh you know what um oh I didn’t open on this computer here okay so what I’ll do is I’ll just hit task manager that was my task manager in the background there we go and so here we can kind of get an idea this computer just restarted so it’s getting it itself in order here and so we can see our mem us is at 21% that’s what we really want to keep a track of um so what I’m going to do is go back over to LM Studio we’re going to open it up but this is stuff that really happens to me where it’s like you’re using local LMS and things crash and it’s not a big deal just happens but we came back here and it actually did do it it said thought for 3 minutes and 4 seconds and you can see its reasoning here okay it says the translation of I likeing Sushi into Japanese isi sushim Guk which is true the structure correctly places it one thing I’d like to ask it is can it give me um Japanese characters so can you show me the uh the sentence can you show me uh Japanese using Japanese characters DG conji and herana okay and so we’ll go ahead and do that it doesn’t have a model selected so we’ll go to the top here what’s kind of interesting is that maybe you can switch between different kinds of models as you’re working here we do have GPU offload of discrete uh model layers I don’t know how to configure any of these things right now um flash attention would be really good so decrease memory usage generation time on some models that is where a model is trained on flash attention which we don’t have here right now but I’m going to go ahead I’m going to load the Llama distilled model and we’re going to go ahead and ask if it can do this for us because that would make it a little bit more useful okay so I’m going to go ahead and run that and we’ll be back here in just a moment and we’ll see the results all right we are back and we can take a look at the results here we’ll just give it a moment I’m going to scroll up and you know what’s really interesting is that um it is working every time I do this I it does work but the computer restarts and I think the reason why is that it’s exhausting all possible resources um now the size of the model is not large it’s whatever it is the 8 billion parameter one at least I think that’s what we’re running here um it’s a bit hard because it says 8 billion uh distilled and so we’d have to take a closer look at it it says 8 billion so it’s 8 billion parameter um but the thing is it’s the reasoning that’s happening behind the scenes and so um I think for that it’s exhausting whereas we’re when we’re using llama it’s less of an issue um and I think it might just be that LM Studio the way the agent Works might might not have ways of or at least I don’t know how to configure it to make sure that it doesn’t uh uh destroy destroy stuff when it runs out here because you’ll notice here that we can set the context length and so maybe if I reduce that keep model in memory so Reserve System memory for the model even when offload GPU improves performance but requires more RAM so here you know we might toggle this off and get better production but right now when I run it it is restarting but the thing is it is working so you can see here it thought for 21 seconds it says of course I’d like to help you and so here’s some examples and it’s producing pretty good code or like output I should say but anyway what we’ve done here is we’ve just changed a few options so I’m saying don’t keep it in memory okay because that might be an issue and we’ll bring the context window down and it says CPU uh thread to allocate that seems fine to me again I’m not sure about any of these other options we’re going to reload this model okay so we’re now loading with those options I want to try one more time if my computer restarts it’s not a big deal but again it might be just LM Studio that’s causing us these issues here and so I’m just going to click into this one I think it’s set up those settings we’ll go ahead and just say Okay um so I’m going to just say like how do I ask how do I I say in Japanese um uh where is the movie theater okay it doesn’t matter if you know Japanese it’s just we’re trying to tax it with something hard so here it’s running again and it’s going to start thinking we’ll give it a moment here and as it’s doing that I’m going to open up task manager he and we’ll give it a moment I noticed that it has my um did it restart again yeah I did so yeah this is just the experience again it has nothing to do with the Intel machine it’s just this is what happens when your resources get exhausted and so it’s going to restart again but this is the best I can de demonstrate it here now I can try to run this on my main machine using the RTX 480 um so that might be another option that we can do where I actually have dedicated GP use and I have a this is like a 14th generation uh Intel chip I think it’s Raptor lake so maybe we’ll try that as well in a separate video here just to see what happens um but that was the example there but I could definitely see how having more than uh like those computer stacked would make this a lot easier even if you had a second one there that’ still be uh more cost effective than buying a completely new computer outright those two or smaller mini PCS um but I’ll be back here in just a moment okay okay so I’m going to get this installed on my main machine my main machine like as I’m recording here it’s using my GPU so it’s going to have to share it so I’m just going to stop this video and then we’re going to treat this one as LM Studio using the RTX 480 and we’ll just see uh if the experience is the same or different okay all right so I’m back here and now I’m on my main computer um and we’re going to use ml studio so I’m going to go and skip the onboarding and I remember uh there’s a way for us to change the theme maybe in the bottom right corner of the Cog and we’ll change it to dark mode here thr our eyes are a little bit uh easier to see here also want to bump up the font a little bit um to select the model I’m going to go here to select a model at the top here we do not want that model here so I’m going to go to maybe here on left hand side no not there um it was here in the bottom left corner and we’re going to go to L LM Studio Ai and we want to make our way over to the model catalog at the top right corner and I’m looking for deep seek R1 distill llama 8B so we click that here and we’ll say use in studio that’s now going to download this locally okay so we are now going to download this model and I’ll be back here in just a moment okay all right so I’ve downloaded the model here I’m going to go ahead and load it and again I’m a little bit concerned because I feel like it’s going to cause this computer to restart but because it’s uh offloading to the gpus I’m hoping that’ll be less of an issue but here you can see it’s loading the model into memory okay and we really should look at our options that we have here um it doesn’t make it very easy to select them but oh here it is right here okay so we have some options here and this one actually is offloading to the GPU so you see it has GPU offload I’m almost wondering if I should have set GPU offload um on the aipc because it technically has IG gpus and maybe that’s where we were running into issues whereas when we were using olama maybe it was already utilizing the gpus I don’t know um but anyway what I want to do is go ahead and ask the same thing so I’m going to say uh can you teach me teach me Japanese for jlpt and5 level so we’ll go ahead and do that we’ll hit enter and again I love how it shows us the thinking that it does here I’m assuming that it’s using um our RTX RTX 480 that I have on this computer and this is going pretty decently fast here it’s not causing my computer to cry this is very good this is actually reasonably good so yeah it’s performing really well so the question is um you know I again I’d like to go try the uh the the developer kit again and see if I because I remember the gpus were not offloading right so maybe it didn’t detect the igpus but this thing is going pretty darn quick here and so that was really really good um and so it’s giving me a bunch of stuff it’s like okay but give me give me example sentences in Japanese okay so that’s what I want we’ll give it a moment yep and that looks good so it is producing really good stuff this model again is just the Llama uh a building parameter one I’m going to eject this model let’s go back over to here into the uh Studio over here and I want to go to the model Catal because there are other deep seek models so we go and take a look deep seek we have coder version two so the younger sibling of GPT 4 deeps coder version 2 model but that sounds like deep seek 2 right so I’m not sure if that’s really the latest one because we only want to focus on R1 and so yeah I don’t think those other ones we really care about we only care about R1 models but you can see we’re getting really good performance so the question is like what’s the compute or the top difference between these two and maybe we can ask this over to the model ourselves but I’m going to start a new conversation here and I’m going to say um how many tops or or is it tops does I think it’s called tops tops does RTX uh 4080 have okay we’ll see if it can do it select this model here and yeah we’ll load the model and we’ll run that there we’ll give it a moment and while that’s thinking I mean obviously we just use Google for this we don’t really need to do that but I want to do a comparison to see like how many tops they have so I’ll let that run the background I’m also just going to search and find out very quickly oh here it goes uh does not have a specified number of tensor uh as officially NV video the company focuses on metrics like cudas cores and mamory B withd but this would be speculative okay but but then but then how do I how do I compare compare tops for um let’s say lunar Lake versus RTX 4080 and I know like there’s lots of ways to do it but it’s like if I can’t compare it how do I do it and while that’s trying to figure it out I’m going to go over to perplexity and maybe we can get an exact example because I’m trying to understand like how much does my discret GPU do compared to that that one that’s internal so we’ll say uh lunar lunar Lake versus RTX uh 40 4080 uh for Tops performance and we’ll see what we get so lunar lake has 120 tops and hence gaming rather than AI workload so IND doesn’t typically advertise their tops maintaining 60 FPS okay but then how so then okay but what what could it be like how many tops could it be for the RTX 480 kind of makes it hard because like we don’t know how many tops it is we don’t we don’t know what kind of expectation we should have with it okay fair enough so yeah so it’s we can’t really compare it’s like apples to oranges I guess and it’s just not going to give us the result here um but here it is going through comparison so if you run ml perfect gpus like a model with reset you directly compare the tops uh with a new architecture and so that’s basically the only way to do it so we can’t it’s apples to oranges um I want to go and attempt to try to run this one more time on the lunar Lake and I want to see if I can set the gpus but if we can’t set the gpus then I think it’s going to always have that issue specifically with this but we will use the L Lake for um using with hugging face and other things like that so be back in just a moment okay all right so I’m back and I just did a little bit of exploration on my other computer there because I want to understand like okay I have this aipc it’s very easy to run this here on my RTX 480 but when I run it on the on the uh the lunar like it is shutting down and I think understand why and so this is I think is really important when you are working local machines you have to have a bit better understanding of the hardware so I’m just going to RDP back into this machine here just give me just a moment okay I have it running again and it probably will crash again but at least I know why so there’s a program called camp and what camp does is it allows you to monitor um your this is for Windows for Mac I don’t know what You’ use you probably just uh uh utility manager but here you know I can see that none of these CPUs are being overloaded but this is just showing us the CPUs if we open up um task manager here okay and now the computer is running perfectly fine it’s not even spinning its fans if I go to the left hand side here we can we have CPUs mpus and gpus now mpus are the things that we want to use because mpus uh like an mpu is specifically designed to run models however a lot of the Frameworks like Pi torch um and uh tensor flow they’re optimized on Cuda originally because the underlying framework and so normally you have to go through an optimization or conversion format I don’t know at this time if there is a conversion for Max for Intel Hardware Because deep seek is so new but I would imagine uh that is something the Intel team is probably working on and this is not just specific to Intel if it’s AMD or whoever they want to make optimization to leverage their different kinds of compute like their MPS and also has to do with the the thing that we’re using so we’re using that thing called this one over here I’m not sure well all these little oh yeah this this just this is core LM showing us all the temperatures right and so what we can do is just kind of see what’s going on here is that I’m going to bring this over so that we can see what’s happening right we want to use mpus it’s not going to happen because this thing is not set up to do that but if I drop it down here and we click into uh this right we have our options before we didn’t have any gpus but we can go here we can say use all the gpus I don’t know how many how much it can offload to but I’ll I’ll set it to something like 24 we have a CPU threat count like that might be something we want to increase we can reduce our context window um we might not want to load it into memory but the point is that if it if it exhausts the GPU because it’s all it’s a single integrated circuit I have a feeling that it’s going to end up restarting it but here again you can see it’s very low we’ll go ahead and we’ll load the model right and the next thing I will do is I will go type in something like you know I want to learn Japanese can you provide me um uh a lesson on Japanese sentence structure okay we’ll go ahead and do that actually notice if it this doesn’t require a thought process it works perfectly it doesn’t cause any issues with the computer we’ll go ahead and run it and let’s pay attention left hand side here and now we can see that it’s utilizing gpus when it was at zero it wasn’t using gpus at all but Noti it’s at 50 50% right and it’s doing pretty good our CPU is higher than usual before when I ran this earlier off screen the CPU was really low and it was the GPU that was working hard so again it really you have to understand your settings as you go here but this is not exhausting so far but we’re just watching these numbers here and also our cor temps right and you can see we’re not running into any issues it’s not even spinning up it’s not even making any complaints right now the other challenge is that I have a a developer kit that um uh it’s it’s something they don’t sell right so if there was an issue with the BIOS I’d have to update it and there’s like no all I can get is Intel’s help on it but if I to buy like a commercial version of this like um whoever is partnered with it if it’s Asus or Lenovo or whatever I would probably have um less issues because they’re maintaining those bios updates um but so far we’re not having issues but again we’re just monitoring here we have 46 47% 41% um again we’re watching it you can see core is at 84% 89% and so we’re just carefully watching this stuff but I might have picked the perfect the perfect amount of settings here and maybe that was the thing is that you know I turned down the CPU like what did we do the options I turned the gpus down so I turned that down I also told it not to load memory and now it’s not crashing okay there we go it’s not as fast as the RTX 4080 um but you know what this is my old graphics card here I actually bought this uh not even long ago before I got my new computer this is an RTX 3060 okay this is not that old it’s like a it’s like a couple years old 2022 and I would say that when I used to use that and I would run models my computer would crash right so but the point is is that these newer CPUs whether it’s again the M4 or the Intel L lake or whatever amd’s one is they’re they have the strong equivalence of like graphics cards from two years ago which is crazy to me um but anyway I think I might have found The Sweet Spot I’m just really really lucky but you can see the memory usage here and stuff like that and you just have to kind of monitor it and you’ll find out once you get those settings uh what works for you or you know you buy really expensive GPU and uh it’ll run perfectly fine but here it’s going and we’ll just give it a moment we be back in just a moment okay anyway I was going a little bit slow so you know I just decided we’ll just move on here but my my point was made clear is that if you dial in the specific settings you can make this stuff work on things where you don’t have dedicated graphics card if you have a dedicated graphics card you can see it’s pretty good and uh yeah this is fine with the RTX 480 so you know if you have that you’re going to be in good shape there but now that we’ve shown how to do with AI power assistance let’s take a look at how we can actually get these models from hugging face next okay and work with them programmatically um so I’ll see you in the next one all right so what I want to do in this video is I want to see if we can download the model from hugging phase and then work with it programmatically um is that’s going to give you the most flexibility with these models of course if you just want to consume them then uh using the um LM Studio that I showed you or whatever it was called um would be the easiest way to do it but having a better understanding of these models how we can use them directly would be useful I think for the rest of this I’m just going to use the RTX 480 because I realize that to really make use of aips you have to wait till they have optimizers for it so we’re talking about um Intel again you have this kit called open Veno and open Veno is an optimization framework and if we go down they I think they have like a bunch of examples here we’ll go back for a moment yeah quick examples maybe over here and maybe not over here but we go back to the notebooks and we scroll on down yeah they have this page here and so um in this thing they will have different llms that are optimized specifically so that you can maybe Leverage The mpus or the or or make it run better on CPUs but until that’s out there we’re stuck on the gpus and we’re not going to get the best performance that we can uh so maybe in a in a month or so um I can revisit that and then I will be utilizing it it might be as fast as my RTX 480 but for now we’re going to just stick uh with the RTX 480 and we’ll go look at Deep seek because they have more than just R1 so you can see there is a collection of models and in here if we click into it we have um R1 r10 which I don’t know what that is let’s go take a look here it probably explains it somewhere uh but we have R1 distilled 70 billion PR parameter quen 32 billion parameter quen 14 billion and so we have some variant here that we can utilize just give me a moment I want to see what zero is so to me it sounds like zero is the precursor to R1 so it says a model trained with supervised learning okay and so I don’t think we want to use zero we want to use the R1 model or one of these distilled versions which uh give similar capabilities but if we go over to here it’s not 100% clear on how we can run this um but down below here we can see total parameters is 671 billion okay so this one literally is the big one this is the really really big one and so that would be a little bit too hard for us to run this machine we can’t run 671 billion parameters you saw the person stacking all those uh Apple m4s like uh yeah I have an RTX 480 but I need a bunch of those to do it down below we have the distilled models and so this is probably what we were using when we were using olama um if we wanted to go ahead and do that there so this is probably where I would focus my attention on is these distilled models uh when we’re using hugging face it will show us how we can deploy the models up here notice over here we have BLM um I covered this in my geni essentials course I believe but um there are different types of ways we can serve models just as web servers have you know servers to serve them like the uh like software underneath so do um uh these ml models these machine learning models and VM is one that you want to pay attention attention to because it can work with the ray framework and Ray is important because um say Ray uh I’ll just say ml here but this framework specifically has a product within it um called racer it’s not showing me the graphic here but racer allows you to take VM and distribute it across comput so when we saw that video of that again those Mac m4s being stacked on top of each other that was probably using racer with v LM to scale it out and so if you were to run this uh run this you might want to invest in VM the hugging face Transformer library is fine as well but again we’re not going to be able to run this on my computer and not on your computer uh so we’re going to go back here for a moment but there’s also uh V3 which has been very popular as well and that actually is what we were using when we went to the Deep seek website but if we go over to here and we go into deep seek uh three I think this is yeah this one’s a mixture of experts model and this would be a really interesting want to deploy as well but it’s also 67 uh 71 billion parameter model so it’s another one that we can’t deploy locally right but if we did we could have like Vision tasks and all these other things that maybe it could do so we’re going to really just have to stick with the R1 and it’s going to be with one of these distributions I’m going to go with the Llama uh 8 billion parameter I don’t know why we don’t see the other ones there but 8 billion is something we know that we can reibly run whether it’s on the lunar lake or if it’s on the RTX 480 and so I’m going to go over here in the right hand side and we have Transformers and VMS Transformers is probably the easiest way to run it and so we can see that we have some code here so I’m going to get set up here um I’m going to just open up vs code and I already have a Repel I’m going to put this in my geni essentials course because I figured if we’re going to do it we might as well put it in there and so I’m going to go and open that folder here and I need to go up a directory I might not even have this cloned so I’m going to just go and grab this directory really quickly here so just CD back and I do not so I’m going to go over to GitHub this repo is completely open so if you want to do the same thing you can do this as well we’re going to say gen Essentials okay and um I’m going to go ahead and just co uh copy this and download it here so give it a clone get clone and I’m going to go ahead and open this up um I’m going to open this with wind Surfer fun because I really like wind surf I’ve been using that quite a bit if I have it installed here should yeah I do I have a paid version of wind surf so I have full access to it if you don’t just you can just copy and paste the code but I’m trying to save myself some time here so we’re going to open this up I’m going to go into the Gen Essentials I’m going to make a new folder in here I’m going to call this one deep seek and I want to go inside of this one and call it um R1 uh Transformers cuz we’re going to just use the Transformers library to do this I’m going to select that folder we’re going to say yes I’m going to make a new file here and I probably want to make this an iron python file um I’m not sure if I’m set up for that but we’ll give it a go so what we’ll do is we’ll type in basic. [Music] ironpython uh ynb which is for uh jupyter notebooks and you’d have to already have jupyter installed if you don’t know in my gen Essentials I show you how to set the stuff up so so you can learn it that way if you want I’m going to go over to WSL here and um yeah I’ll install that extension there if it wants to install there and I’m going to see if I have cond installed I should have it installed there it is and we have a base so anytime that you are um setting up one of these environments you should really make a new one because that way you’ll run into less conflicts and so I need to set up a new environment I can’t remember the instructions but I’m pretty certain I show that somewhere here at local development in this folder and so if I go to cond and I go into setup I think I explain it here so for Linux and that’s what I’m using right now with Windows subsystem Linux 2 is I would need to it’s already installed so I want to create a new environment so I probably want to use Python 3.0.0 if it’s a future you might want to use 312 but this version seems to give me the least amount of problems so I want this command but I want to change it a little bit I don’t want it to be hello I want to call this deep seek so we’ll go back over to here we’re we’re going to paste it into here and um so now we are uh setting up python 310 and it’s going to install some stuff okay so now we are uh good I need to activate that so I’m say cond activate deep seek and so now we are using deep seek I’m going to go back here on the right hand left hand side and what I want to do is I want to get some code set up here so if we go back over to here into the 8 billion uh distilled model we go to Transformers we have some code and if it doesn’t work that’s totally fine we will we will tweak it from there I also have example code lying around so for whatever reason this doesn’t work sorry I just paused there for a second if it doesn’t work we can uh grab from my code base here because I don’t always remember how to do this stuff even though I’ve done a lot of this I don’t remember half the stuff that I do so we’re going to go ahead here and cut this up and put this up here but we’re going to need um I’m not sure how well uh um uh I’m not sure how well um uh wind surf Works within uh jup ir and python I actually never did that before so it’s asking us it’s asking us to start something you need to select a kernel and I’m going to say oh it’s not seen the kernels that I want but you know one thing I don’t think we did is I don’t think we installed iron python so there’s an extra step that we’re supposed to do to get it to work with Jupiter and it might be under our Jupiter instructions here where yes it’s this so we need to make sure we install iron python kernel otherwise it might not show up here so I’m going to just go ahead here and um I’m going to do cond cond whoops cond hyphen Fonda Forge so we’re saying downloads from the cond forge and and I think it’s cond install so it’s cond install hyphen f cond Forge and then we paste in IP kernel and so now it should install IP kernel I’m not sure if that uh worked or not we’ll go up here and take a look the following packages are not available for in installation oh it’s hyphen c not hyphen f okay so we’ll go here and that just means to use the condo Forge and so this should resolve our issue so we’re going to install ipy kernel right give it a second it we’ll say yes okay and so I’m hoping what that will do is that we’ll be able to actually select the kernel we might have to close that wind Surf and reopen it we can do the same thing in vs code it’s the same interface right so I’m not seeing it showing up here so I’m just going to close that wind surf it would have been nice to use wind surf but if we can’t that’s totally fine I’m going to go ahead and open this again I’m going to open up the Gen Essentials I’m just going to say open I’m not using any AI coding assistant here so we’re just going to work through it the oldfashioned way and somewhere in here we have a deep seek folder I’m going to go ahead and make a new terminal here I want to make sure that I’m in in WSL which I am I’m going to say cond to activate deep seek because that’s where I need to go so I now have that activated I’m going to go into the deep seek folder into our R1 Transformers folder um I’m looking for the Deep seek folder there it is we’ll click into it and I did not save any of the code which is totally fine it’s not like it’s too far away to get this code again and so I’m going to go back over to here and we are going to grab this code okay I’m going to paste it in and we’ll make a new code block and I want to grab this and put this below okay now normally we’ install pytorch and some other things um but I’m going to just try from the most barebones thing it’s going to tell me Transformers isn’t installed and that’s totally fine and I’m just trying to there we go do this so we’ll run that and so I’m going to go here to install Jupiter oh it’s installing Jupiter I see okay so we do need that maybe the kernel would have worked um and so I’m going to go to python environments python environments and so now we have deep seek so maybe we could have got it to work with W serve but that’s fine so we don’t have Transformers installed there’s no modules called Transformers I know that we do this before so we might as well go leverage code and see what we did here before here we have hugging face basic and so here yeah we do an install with Transformer so that’s all we really need there’s P Pi dot. EnV we might also need that as well because we might need to put in our hugging face API to download the model I’m not sure at this point but I’ll go ahead and just install that up here in the top okay so we’ll give that a moment to install it shouldn’t take too long we might also need to install P torch or or tensor flow or both um that’s very common when you are working with open source models is that they may be in one for format or another and need to be converted over um sometimes you don’t need to do it at all but we’ll see so now it’s saying to restart so we’ll just do a restart here we should only have to do that once and so I’m going to go ahead Here and Now include it so now we have less of an issue here it’s showing us this model so basically this will download it specifically from hugging phas so if we grab this address here and we go back over to wherever um I had one open here just a moment ago and it should match this address right so if I was to just delete this out here and put it in here it’s the same address right so that’s how it knows what model it’s grabbing but we’ll go back over to here um and it doesn’t look like we need our hugging face API but we’ll we’ll find out here in just a moment so it should download it we’ll get a message here we’ll load Transformers we’ll have tokenizers then we’ll have the model um the messages here is being passed into here says copy local model directory directly okay so I think here it’s like we just have two different ones we have one that’s using the pre-train one yes there’s two ways that we can do it I think we cover this uh when you use a direct model or a pipeline and so let’s go ahead and see if we can just use the pipeline okay and if I don’t remember how to do this we probably go over here and take a look um I don’t remember everything that I do but yeah this is the one we just had open here just a moment ago the basic one and so this has a pipeline and then we just use it and so this in a sense should just work so let’s go ahead and see if that works so I’m just going to separate this out so I don’t have to continually run this we’ll cut this out okay we’ll run that and then we’ll run this okay and we’ll go down below and it says at least one tensor flow or pie torch should be installed to install tensor flow do this and so this is what I figured we were going to run into where it’s complain like hey you need P torch or tensor flow um I don’t know which one it needs I would think that it was safe tensorflow because I saw that and so I’m going to just go ahead and make a new one up here I’m really just guessing I’m going to go say uh tensor flow and I’m also going to just say p torch let’s just install both because it’ll need one or the other and one of them will work assuming I spelled it right two competing Frameworks I learned uh tensor flow first and then uh I kind of regret that because P torch is now the most po even though I really like tensor flow or specifically kirz but we’ll give this a moment to install and then once we do that we’ll run it again and we’ll see what happens okay so it’s saying P torch failed to build and I hope that doesn’t matter because if it uses tensor flow it’s fine but it says failed to build installable wheels so just a moment here as was my twin sister calling me uh she doesn’t know I’m recording right now so I’m going to go ahead and restart this even though we don’t have P torch or it might be wrong it might be installed I’m not sure we’re going to go ahead and just try it again anyway um because sometimes this stuff just works anyway and we’ll run it and so it is complaining it’s saying at least one one of tensorflow or P should be installed install tensorflow 2.0 uh to install P torch read the instructions here um okay so I mean this shouldn’t be such a huge issue so I’m going to go and let’s use deep seek since we are big deep seek fans here today but I’m going to go over to the Deep seek website which is running V3 it’s not even using the R1 um I’m going to log into here we’ll give it a moment and we’ll go here and say um you know I want to uh I need to install tensor flow 2.0 and pytorch to run uh a Transformers pipeline model so we’ll give that a go and see what we get so here it’s specifically saying to use 2.0 yeah and it’s always a little bit tricky so I’m going to go back up to here and maybe we can say equals 2.0.0 I mean what it it did install tensor flow 20 we don’t need to tell it to do two again so we go down below here and let me just carefully look here so at least one of tensorflow 2.0 or py to should be install to install it you should have it the select framework tensor for the Pyar to use the model pass returns a tuple framework oh so it’s asking which model to use as it doesn’t know okay so I’m going to go back over to here and I’m going to say like you know give it this thing and see if it can figure it out and it’s not exactly what I want so I’m going to just stop it here I’m just saying like I am using Transformers pipeline how do I specify uh the framework okay I’m I’m surprised I have to specify the framework usually it just picks it up okay and so here we have Pi torture tensor flow I think tensorflow successfully installed I’m not sure if it’s just guessing because this thing could be hallucinating we don’t know uh but we’ll go ahead and just give this a try and we’ll run this here and here it’s saying um we’re still getting that right so I’m going to go over to here this probably is a common hugging face issue for tensor flow somebody has commented here you need to have P torch installed mhm so let’s say deep C I don’t know if there’s anyone that’s actually told us how to do this yet give me a second let me see if I can figure it out all right so I went over and we’re asking Claude instead and so maybe Claude again because it’s not just the model itself but it’s the reasoning behind it and so V3 didn’t really get us very far it’s supposed to be a really good model um but um here it’s suggesting that um P torch is generally used and maybe my instructions here is incorrect and so it’s suggesting to do um I mean we have tensorflow which is fine but here it’s suggesting that we do torch um torch and accelerate okay so I’m going to go ahead and run this here so maybe Pi torch is just torch and I just forgot I don’t know why I wrote in pi torch we’ll give that a moment we’ll see what happens uh the other thing is that it’s saying that we probably don’t need the framework specify because well it’s saying for llama in particular that it normally uses P torch I’m not sure if that’s the case here um another thing that we could do is go take a look at hugging face or sorry not hugging face yeah hugging face and look at the files here and I’m seeing tensorflow files so it makes me think that it is using tensorflow but maybe it needs to convert it over to P torch I don’t know but um we should have both installed so even though I removed it from the top there um tensorflow is still installed and we could just leave it uh there as a separate line with say pip install um tensor flow this is half the battle to to get these things to work is is dealing with these conflicts and you will get something completely different than me and you have to work through it but we’ll wait for this um it would be interesting to see if we could serve this via a VM but we’ll just first work this way okay all right so that’s now installed I’m going to go to the top here and we’re going to give it a restart and so now we should have those installed we’ll go ahead and do Transformers pipelines and we’ll go run this next and so now it’s working so that’s really good um um is it utilizing my gpus I would think so sometimes there’s some configurations here that you have to set but I didn’t set anything here I think right now it’s just downloading the model so we’re going to wait for the model to download and then we just want to see if it infers um I’m not sure why it’s uh not getting here but maybe it’ll take a moment to get going um we didn’t provide it any hugging face API key so maybe that’s the issue it’s kind of hanging here so it makes me really think that I need my hugging face API key so what I’m going to do is I’m going to grab this code over here because I just assume that it wants it that’s probably what it is and sorry I’m going to just pull this up here oops we’ll paste this in here as such and I’m going to drag this on up here and I’m going to just make a new env. text I’m also going to just ignore that because I don’t want it to end up in there and um it’s like hugging face API key I never remember what it is um but we’ll go take a look here I’m just doing this off screen here so say hugging face API key nbar okay so key where are you key I’m having a hard time finding the name of the environment variable right now uh oh it’s a HF token that’s what it is so I need the HF token and I’m going to go back here and see if it’s actually downloaded at all did it did it move at all no it hasn’t so I don’t think it’s going to move and I think it’s because it needs um I think it needs the hugging face API key so I’m over here in hugging face and I have an account you go over down below you go to access tokens I got to log in one sec all right and so I’m going to create a new token it’s going to be read only this will be for deep deep spe deep uh deep seek there was no settings that I had to accept to be able to download it so I think it’s going to work I’m going to get rid of my key later on so I don’t care if you see it um I’m in this file here so that was called uh HF token I believe HF token and so now we have our token supposedly set we’ll go back over to here I’m going to go and scroll up and I’m going to run this and now it should know about my token I shouldn’t even have to set it I don’t think so maybe it’ll download now I’m not sure I’m go back over to this one notice we’re not pumping the token in anywhere I’m just going to bring this also down by one this is acting a little bit funny here today I’m not sure why like why is going all the way down there it’s probably just the way the messaging works here I’m going to cut this here and paste it down below so I’m really just trying to get this to trigger and I mean this one’s this other one here but it’s not it’s not doing anything another way we could do it is we could just download it directly I don’t like doing it that way but we could also do it that way but I’m just looking for the hugging face uh token and bars yeah it’s HF HF tokens yeah so I have it right but why it’s not downloading I don’t know um let’s go take a look at that page and just make sure that there wasn’t anything that we had to accept sometimes that’s a requirement where it’s like hey if you don’t accept the things they won’t give you access to it so if I go over here to the model card it doesn’t show anything that I have to select to download this [Music] yeah there’s nothing here whatsoever right so again just carefully looking here we have some safe tensors that’s fine oh here it goes okay so we just had to be a little bit patient it’s probably a really popular model right now and that’s probably why it’s so hard to download but um I’m just going to wait till this is done downloading I’ll be back here in just a moment it’s it’s downloading and running the pipe line okay I did put the print down below here so it might um execute it here might execute it up there we’ll find out here in a moment this one might be redundant because I took it out while it was running live here but we’ll wait for this to finish okay it’s taking a significant time to download oh maybe it’s just almost done here but um yeah downloading from shards getting the checkpoints now it’s starting to write run saying Cuda zero I think that means it’s going to utilize my gpus I’m pretty sure zero is gpus and one is CPU I’m not sure why that is but um now it appears to be running okay so we’ll just wait a little bit longer now the thing is is that once this model is downloaded right we can just call pipe every time and it’ll be a lot faster right we’ll wait a little bit longer okay all right I’m back here and um I mean it ran the first part of the pipeline which is fine but I guess I didn’t run this line here so we’ll run it and since we separate out I think this one’s defined hopefully it is and we’ll run this and and it should work it’s probably now just doing its thing trying to run but we’ll give it a moment and we’ll see uh what happens here okay yeah I don’t think it should take this long to run I’m going to stop this and we’re going to run this again and I think it’ll be faster this time working because my video here is uh the video I’m recording here is kind of struggling that’s why I like to use uh an external external thing here because now my computer is [Music] hanging so what I might need to do here is pause if I can all right I’m kind of back um my uh my computer almost crashed again it’s not I’m telling you it’s not the the lunar Lake it’s these things can exhaust all your resources and that’s why it’s really good to have an external computer that’s specifically dedicated like an aipc or even a dedicated PC with gpus not on your main machine but um there is a tool here called Nvidia SMI and it will actually show us uh the usage here and um it’s probably not going to tell us much now because it’s uh already running here but as this is running we can use this to figure out what is the usage of um gpus that are going on here but I’m going to go back up here for a moment we’ll take a look so um it says CPU went out of memory so Cuda Colonels uh uh they cnly reported some API calls so this is what I mean where this could be a little bit challenging and again we downloaded the other models but those other models that we saw and by the way I’ll bring my head back in here so we stop seeing uh EOS uh webcam utility here but but the thing that we saw was that um uh when we used uh ol to download it was using ggf which is a format that is optimized to run on CPUs right and it can utilize gpus as well so it was already optimized whereas uh the model we’re downloading is not optimized I don’t think and um apparently I just don’t have enough to run it at the 8 billion parameter one but the question is is it downloading the correct one so if we go back over to here right this one is distilled 8 billion parameter it has to be it right because um because of that there and so we might actually not even be able to run this at least not in that format okay so you can see where the challenges are coming here so we go over to our files and we take a look here we can see we have a bunch of safe tensors that’s not going to really help us that much we got to go back into deep seek here and we’ll look into um the ones that they have here well here’s the question is it yeah we did the 8 billion 8 billion parameter one so we go into here 8 billion there is quen 7 billion which is a bit smaller there’s also the 1.5 billion one that’s not going to be useful for us but you know what I’m kind of exhausting my resources here so we can run this as an example and then if you had more resources like more RAM then you’ll have less of a problem so I’m going to go ahead and copy this over here and we’re going to go ahead and paste it in here as such okay so now we are literally just using a smaller model because I don’t think I have enough um uh memory in order to run this especially when I’m recording this at the same time and you know if we go over to here um I’m just typ in clear here um so we have fan temperature performance you can see none of the gpus are being used right now so if we knew it if we knew that they would be showing up over here right the gpus and so right now I think it’s just trying to attempt to download the model because we swapped out the model right so at some point here it should say hey we’re downloading the model it’s not for some reason but we’ll give it a moment okay because the other one took a bit of time to get going so I’m going to pause until I see something all right so after waiting a while this one ran it says Cuda out of memory Cuda external errors might be asynchronous reported at the API calls and stack and so it keeps running on a memory and I think that’s more of an issue of this computer so I might have to restart and run this again so I’m going to be back I’m going to stop the video I’m going to restart it’s the easiest way to dump memory because I don’t know any other way to do it but you know if I go here I mean it shows no memory usage so I’m not really sure what the issue is but I’m going to um restart I’m also going to close OBS I’m going to run it offline and then I’m going to tell I’m going to show you the results okay be back in just a moment all right I’m back and I also uh just went ahead and I ran it and this time it worked much faster so I’m not sure maybe it was holding on to the cache of the old one that was in here but giving my computer a nice restart really did help it out and you can see that we are getting the model to run um I don’t need to run the pipeline every single time I’m not sure why I ran that twice but I should be able to run this again again I’m recording so maybe this won’t work well as it is utilizing the gpus we’ll see [Music] here so now it’s struggling but literally I ran this and it was almost instantaneous like how fast it was that it ran so yeah I think it might be fighting for resources um and that is that is a little bit tricky for me here we’ll go back over here to Nvidia SMI I mean I’m not seeing any of the processes being utilized so it’s kind of hard to tell what’s going on here but I’m going to go ahead and just stop this can I stop this but it clearly works so even though I can’t show you yeah see over here says volatile GPU utilization 100% And then down here it says 33% I thought that these cores would start spitting up so we could we could make sense of it and then here I guess is the memory usage so over here you could see we have 790 of 8 818 and here we can see kind of the limits of it but if I run it again you can see that my me recording just this video is using up uh the memory so that kind of makes it a bit of a challenge um and the only way I could do that is maybe if I was to uh use onboard Graphics which um are not working for me um because I don’t know if I even have any onboard Graphics but that’s okay so anyway um that’s our that’s our example here that we got working it clearly does work I would like to try to do another video where we use VM but I’m not sure if that is possible um but we’ll consider this part done and if there’s a video after that then you know that I was able to get BLM to work see you the next one all right that’s my crash course into uh deep seek I want to give you some of my thoughts about how I think our crash course went here and what we learned as we were working through it um one thing I realized is that um in order to run these models uh you really do need optimized models and when we’re using ama if if you remember it had the ggf extension that’s that file format that is um more optimized to run on CPUs I know that with llama index um for my gen Essentials course when I did that exploration so optimized models are going to make these things a lot more accessible when we were using uh notebook LM or whatever it was called uh we saw that it was it wasn’t notebook LM it was LM Studio notebook LM is a Google product but LM Studio it was adding that extra thought processes and so so more things were happening there it was exhausting the machine um even on my main machine where I have an RTX 480 which was really good you could see that it ran ran well but then when we were trying to work with it directly where we didn’t have an optimized model that we were downloading um my computer was restarting so it was exhausting both my machines trying to run it though I think on this machine because I was using OBS it is using a lot of my resources but uh there’s a video that I did not add to this where I was trying to run it on VM and I was even trying to use 1.5 the 1.5 billion uh quen distilled model and it was saying I was running out of memory so you can see this stuff is really really tricky um and even with an RTX 480 and with my lunar Lake um there were challenges but there are areas that we can utilize it I don’t think we’re exactly there yet to have a full AI powered assistant with with thought and reasoning um but the RTX 480 kind of handled it if that if that’s all you’re using it for and you’re restarting those conversations um and then you’re fine tuning those some of those things down and then the lunar could do it if if we tuned it down one thing that I did say that um I realize after doing a bit more research CU I forget all the stuff that I learned but mpus are not really designed to use LMS I was saying earlier maybe there’s a way to optimize it or something but mpus are designed to run smaller models alongside your llms for your workloads so you can distribute uh a more complex AI workload so maybe you have an llm and it has a smaller model that does something like images or something something I don’t know something um and maybe you can utilize that mpus um but you know we’re not going to ever at least in the next couple years we’re not going to see anything utilizing mpus to run llms it’s really the gpus and so we are really fixed on that the igpu on the lunar Lake and then what our RTX the RTX 4080 can do um so you know maybe if I had another graphics card and I actually do I have a 3060 but unfortunately the computer I bought doesn’t allow me to slot in slotted in so if there was a way I could distribute the compute from this computer and my old computer or even the lunar Lake as well then I bet I could run something that is a little bit better um but you know you probably want uh like a a homebuilt computer with two graphics cards in it or you want multiple multiple uh aips that are stacked that have distributed compute um and just as as we saw that video where the person was running the uh 671 billion parameter model if you paid close attention to um the uh the the uh post it actually said in there that it was running it on 4 bit quantization so that wasn’t just the model running at its full Precision it was running it highly quantized and so quantization can be good but if it’s at four bit that’s really small and so and it was chugging along so you know the question really is is like okay even if you had seven or eight of those you’d still have to quantize it which is not easy and it’s still even it’s still slow and would the results be any good so as a example it was cool but I think that 271 billion parameter model is really far Out Of Reach um but that means we can try to Target one of these other ones like if it’s 70 70 billion billion parameter model or maybe we just want to reliably run the 7 billion building parameter model by having one extra computer and so you’re looking at depending if if you’re smart about it 1,000 ,500 and then you can uh run a model it’s not going to be as good as these as Chachi BT or Claude but it definitely will pave the way there um we’ll just have to continue to wait for these models to be optimized and for uh the hardware to improve or the cost to go down but maybe we’re just two computers away or two graphics cards away um but yeah that’s my two cents there and I’ll see you in the next one okay ciao

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
January 29, 2025
ChatGPT for Data Analytics: A Beginner’s Tutorial
ChatGPT for Data Analytics: FAQ

1. What is ChatGPT and how can it be used for data analytics?

ChatGPT is a powerful language model developed by OpenAI. For data analytics, it can be used to automate tasks, generate code, analyze data, and create visualizations. ChatGPT can understand and respond to complex analytical questions, perform statistical analysis, and even build predictive models.

2. What are the different ChatGPT subscription options and which one is recommended for this course?

There are two main options: ChatGPT Plus and ChatGPT Enterprise. ChatGPT Plus, costing around $20 per month, provides access to the most advanced models, including GPT-4, plugins, and advanced data analysis capabilities. ChatGPT Enterprise is designed for organizations handling sensitive data and offers enhanced security features. ChatGPT Plus is recommended for this course.

3. What are “prompts” in ChatGPT, and how can I write effective prompts for data analysis?

A prompt is an instruction or question given to ChatGPT. An effective prompt includes both context (e.g., “I’m a data analyst working on sales data”) and a task (e.g., “Calculate the average monthly sales for each region”). Clear and specific prompts yield better results.

4. How can I make ChatGPT understand my specific needs and preferences for data analysis?

ChatGPT offers “Custom Instructions” in the settings. Here, you can provide information about yourself and your desired response style. For example, you can specify that you prefer concise answers, data visualizations, or a specific level of technical detail.

5. Can ChatGPT analyze images, such as graphs and charts, for data insights?

Yes! ChatGPT’s advanced models have image understanding capabilities. You can upload an image of a graph, and ChatGPT can interpret its contents, extract data points, and provide insights. It can even interpret complex visualizations like box plots and data models.

6. What is the Advanced Data Analysis plugin, and how do I use it?

The Advanced Data Analysis plugin allows you to upload datasets directly to ChatGPT. You can import files like CSVs, Excel spreadsheets, and JSON files. Once uploaded, ChatGPT can perform statistical analysis, generate visualizations, clean data, and even build machine learning models.

7. What are the limitations of ChatGPT for data analysis, and are there any security concerns?

ChatGPT has limitations in terms of file size uploads and internet access. It may struggle with very large datasets or require workarounds. Regarding security, it’s not recommended to upload sensitive data to ChatGPT Plus. ChatGPT Enterprise offers a more secure environment for handling confidential information.

8. How can I learn more about using ChatGPT for data analytics and get hands-on experience?

This FAQ provides a starting point, but to go deeper, consider enrolling in a dedicated course on “ChatGPT for Data Analytics.” Such courses offer comprehensive guidance, practical exercises, and access to instructors who can answer your specific questions.

ChatGPT for Data Analytics: A Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What are the two main ChatGPT subscription options discussed and who are they typically used by?
2. Why is ChatGPT Plus often preferred over the free version for data analytics?
3. What is the significance of “context” and “task” when formulating prompts for ChatGPT?
4. How can custom instructions in ChatGPT enhance the user experience and results?
5. Explain the unique application of ChatGPT’s image recognition capabilities in data analytics.
6. What limitation of ChatGPT’s image analysis is highlighted in the tutorial?
7. What is the primary advantage of the Advanced Data Analysis plugin in ChatGPT?
8. Describe the potential issue of environment timeout when using the Advanced Data Analysis plugin and its workaround.
9. Why is caution advised when uploading sensitive data to ChatGPT Plus?
10. What is the recommended solution for handling secure and confidential data in ChatGPT?
Answer Key
1. The two options are ChatGPT Plus, used by freelancers, contractors, and job seekers, and ChatGPT Enterprise, used by companies for their employees.
2. ChatGPT Plus offers access to the latest models (like GPT-4), faster response times, plugins, and advanced data analysis, all crucial for data analytics tasks.
3. Context provides background information (e.g., “I am a marketing analyst”) while task specifies the action (e.g., “analyze this dataset”). Together, they create focused prompts for relevant results.
4. Custom instructions allow users to set their role and preferred response style, ensuring consistent, personalized results without repeating context in every prompt.
5. ChatGPT can analyze charts and data models from uploaded images, extracting insights and generating code, eliminating manual interpretation.
6. ChatGPT cannot directly analyze graphs included within code output. Users must copy and re-upload the image for analysis.
7. The Advanced Data Analysis plugin allows users to upload datasets for analysis, statistical processing, predictive modeling, and data visualization, all within ChatGPT.
8. The plugin’s environment may timeout, rendering previous files inactive. Re-uploading the file restores the environment and analysis progress.
9. ChatGPT Plus’s data security for sensitive data, even with disabled training and history, is unclear. Uploading confidential or HIPAA-protected information is discouraged.
10. ChatGPT Enterprise offers enhanced security and compliance (e.g., SOC 2) for handling sensitive data, making it suitable for confidential and HIPAA-protected information.
Essay Questions
1. Discuss the importance of prompting techniques in maximizing the effectiveness of ChatGPT for data analytics. Use examples from the tutorial to illustrate your points.
2. Compare and contrast the functionalities of ChatGPT with and without the Advanced Data Analysis plugin. How does the plugin transform the user experience for data analysis tasks?
3. Analyze the ethical considerations surrounding the use of ChatGPT for data analysis, particularly concerning data privacy and security. Propose solutions for responsible and ethical implementation.
4. Explain how ChatGPT’s image analysis capability can revolutionize the way data analysts approach tasks involving charts, visualizations, and data models. Provide potential real-world applications.
5. Based on the tutorial, discuss the strengths and limitations of ChatGPT as a tool for data analytics. How can users leverage its strengths while mitigating its weaknesses?
Glossary
- ChatGPT Plus: A paid subscription option for ChatGPT providing access to advanced features, faster response times, and priority access to new models.
- ChatGPT Enterprise: A secure, compliant version of ChatGPT designed for businesses handling sensitive data with features like SOC 2 compliance and data encryption.
- Prompt: An instruction or question given to ChatGPT to guide its response and action.
- Context: Background information provided in a prompt to inform ChatGPT about the user’s role, area of interest, or specific requirements.
- Task: The specific action or analysis requested from ChatGPT within a prompt.
- Custom Instructions: A feature in ChatGPT allowing users to preset their context and preferred response style for personalized and consistent results.
- Advanced Data Analysis Plugin: A powerful feature enabling users to upload datasets directly into ChatGPT for analysis, visualization, and predictive modeling.
- Exploratory Data Analysis (EDA): An approach to data analysis focused on visualizing and summarizing data to identify patterns, trends, and potential insights.
- Descriptive Statistics: Summary measures that describe key features of a dataset, including measures of central tendency (e.g., mean), dispersion (e.g., standard deviation), and frequency.
- Machine Learning: A type of artificial intelligence that allows computers to learn from data without explicit programming, often used for predictive modeling.
- Zip File: A compressed file format that reduces file size for easier storage and transfer.
- CSV (Comma Separated Values): A common file format for storing tabular data where values are separated by commas.
- SOC 2 Compliance: A set of standards for managing customer data based on security, availability, processing integrity, confidentiality, and privacy.
- HIPAA (Health Insurance Portability and Accountability Act): A US law that protects the privacy and security of health information.
ChatGPT for Data Analytics: A Beginner’s Guide

Part 1: Introduction & Setup

1. ChatGPT for Data Analytics: What You’ll Learn

This section introduces the tutorial and highlights the potential time savings and automation benefits of using ChatGPT for data analysis.

2. Choosing the Right ChatGPT Option

Explains the different ChatGPT options available, focusing on ChatGPT Plus and ChatGPT Enterprise. It discusses the features, pricing, and ideal use cases for each option.

3. Setting up ChatGPT Plus

Provides a step-by-step guide on how to upgrade to ChatGPT Plus, emphasizing the need for this paid version for accessing advanced features essential to the course.

4. Understanding the ChatGPT Interface

Explores the layout and functionality of ChatGPT, including the sidebar, chat history, settings, and the “Explore” menu for custom-built GPT models.

5. Mastering Basic Prompting Techniques

Introduces the concept of prompting and its importance for effective use of ChatGPT. It emphasizes the need for context and task clarity in prompts and provides examples tailored to different user personas.

6. Optimizing ChatGPT with Custom Instructions

Explains how to personalize ChatGPT’s responses using custom instructions for context and desired output format.

7. Navigating ChatGPT Settings for Optimal Performance

Details the essential settings within ChatGPT, including custom instructions, beta features (plugins, Advanced Data Analysis), and data privacy options.

Part 2: Image Analysis and Advanced Data Analysis

8. Leveraging ChatGPT’s Vision Capabilities for Data Analysis

Introduces ChatGPT’s ability to analyze images, focusing on its application in interpreting data visualizations and data models.

9. Understanding the Advanced Data Analysis Plugin

Introduces the Advanced Data Analysis plugin and its potential for automating various data analysis tasks. It also addresses the plugin’s timeout issue and workarounds.

10. Connecting to Data Sources: Importing and Understanding Datasets

Details how to import datasets from online sources like Kaggle, emphasizing supported file types and demonstrating the process using a dataset of data analyst job postings.

11. Performing Descriptive Statistics and Exploratory Data Analysis

Explores how to generate descriptive statistics and conduct exploratory data analysis (EDA) on imported datasets, using prompts for table formatting and visualization creation.

12. Data Cleanup Techniques for Accurate Analysis

Covers the importance of data cleanup and demonstrates how to use ChatGPT to remove unnecessary spaces and standardize column names for consistent analysis.

13. Visualizing and Interpreting Salary Data

Illustrates how to generate visualizations for salary data, highlighting the need for precise prompting to get accurate visualizations based on average salary and common platforms.

14. Building a Machine Learning Model to Predict Salary

Demonstrates how to build a machine learning model within ChatGPT to predict yearly salary based on job title, platform, and location, including model selection and error interpretation.

15. Limitations of the Advanced Data Analysis Plugin

Discusses the limitations of the Advanced Data Analysis plugin, including file size restrictions, internet access limitations, and data security concerns, particularly with sensitive data. It introduces the notable plugin and ChatGPT Enterprise as potential solutions.

Conclusion

16. Wrapping Up and Next Steps

Summarizes the key takeaways of the tutorial, encouraging the application of learned skills in real-world scenarios and highlighting further learning resources like the full course on ChatGPT for Data Analytics.

Timeline of Events

This source is a transcript of a YouTube tutorial video, not a narrative with a chronological series of events. Therefore, a traditional timeline is not applicable.

However, we can outline the structure of the tutorial, which progresses the viewer through the steps of using ChatGPT for data analysis.

Tutorial Structure:
1. Introduction: The instructor introduces the tutorial and the potential of ChatGPT for data analysis, claiming it can save data analysts up to 20 hours a week.
2. ChatGPT Setup: The tutorial guides viewers through the different ChatGPT options (ChatGPT Plus and ChatGPT Enterprise) and explains how to set up ChatGPT Plus.
3. Understanding ChatGPT Interface: The instructor walks through the layout and functionalities of the ChatGPT interface, highlighting key features and settings.
4. Basic Prompting Techniques: The tutorial delves into basic prompting techniques, emphasizing the importance of providing context and a clear task for ChatGPT to generate effective responses.
5. Custom Instructions: The instructor explains the custom instructions feature in ChatGPT, allowing users to personalize the model’s responses based on their specific needs and preferences.
6. Image Analysis with ChatGPT: The tutorial explores ChatGPT’s ability to analyze images, including its limitations. It demonstrates the practical application of this feature for analyzing data visualizations and generating insights.
7. Introduction to Advanced Data Analysis Plugin: The tutorial shifts to the Advanced Data Analysis plugin, highlighting its capabilities and comparing it to the basic ChatGPT model for data analysis tasks.
8. Connecting to Data Sources: The tutorial guides viewers through importing data into ChatGPT using the Advanced Data Analysis plugin, covering supported file types and demonstrating the process with a data set of data analyst job postings from Kaggle.
9. Descriptive Statistics and Exploratory Data Analysis (EDA): The tutorial demonstrates how to use the Advanced Data Analysis plugin for performing descriptive statistics and EDA on the imported data set, generating visualizations and insights.
10. Data Cleanup: The instructor guides viewers through cleaning up the data set using ChatGPT, highlighting the importance of data quality for accurate analysis.
11. Data Visualization and Interpretation: The tutorial delves into creating visualizations with ChatGPT, including interpreting the results and refining prompts to generate more meaningful insights.
12. Building a Machine Learning Model: The tutorial demonstrates how to build a machine learning model using ChatGPT to predict yearly salary based on job title, job platform, and location. It covers model selection, evaluating model performance, and interpreting predictions.
13. Addressing ChatGPT Limitations: The instructor acknowledges limitations of ChatGPT for data analysis, including file size limits, internet access restrictions, and data security concerns. Workarounds and alternative solutions, such as the Notable plugin and ChatGPT Enterprise, are discussed.
14. Conclusion: The tutorial concludes by emphasizing the value of ChatGPT for data analysis and encourages viewers to explore further applications and resources.
Cast of Characters
- Luke Barousse: The instructor of the tutorial. He identifies as a YouTuber who creates educational content for data enthusiasts. He emphasizes the time-saving benefits of using ChatGPT in a data analyst role.
- Data Nerds: The target audience of the tutorial, encompassing individuals who work with data and are interested in leveraging ChatGPT for their analytical tasks.
- Sam Altman: Briefly mentioned as the former CEO of OpenAI.
- Mira Murati: Briefly mentioned as the interim CEO of OpenAI, replacing Sam Altman.
- ChatGPT: The central character, acting as a large language model and powerful tool for data analysis. The tutorial explores its various capabilities and limitations.
- Advanced Data Analysis Plugin: A crucial feature within ChatGPT, enabling users to import data, perform statistical analysis, generate visualizations, and build machine learning models.
- Notable Plugin: A plugin discussed as a workaround for certain ChatGPT limitations, particularly for handling larger datasets and online data sources.
- ChatGPT Enterprise: An enterprise-level version of ChatGPT mentioned as a more secure option for handling sensitive and confidential data.
Briefing Doc: ChatGPT for Data Analytics Beginner Tutorial

Source: Excerpts from “622-ChatGPT for Data Analytics Beginner Tutorial.pdf” (likely a transcript from a YouTube tutorial)

Main Themes:
- ChatGPT for Data Analytics: The tutorial focuses on utilizing ChatGPT, specifically the GPT-4 model with the Advanced Data Analysis plugin, to perform various data analytics tasks efficiently.
- Prompt Engineering: Emphasizes the importance of crafting effective prompts by providing context and specifying the desired task for ChatGPT to understand and generate relevant outputs.
- Advanced Data Analysis Capabilities: Showcases the plugin’s ability to import and analyze data from various file types, generate descriptive statistics and visualizations, clean data, and even build predictive models.
- Addressing Limitations: Acknowledges ChatGPT’s limitations, including knowledge cut-off dates, file size restrictions for uploads, and potential data security concerns. Offers workarounds and alternative solutions, such as the Notable plugin and ChatGPT Enterprise.
Most Important Ideas/Facts:
1. ChatGPT Plus/Enterprise Required: The tutorial strongly recommends using ChatGPT Plus for access to GPT-4 and the Advanced Data Analysis plugin. ChatGPT Enterprise is highlighted for handling sensitive data due to its security compliance certifications.
- “Make sure you’re comfortable with paying that 20 bucks per month before proceeding but just to reiterate you do need this chat gbt Plus for this course.”
1. Custom Instructions for Context: Setting up custom instructions within ChatGPT is crucial for providing ongoing context about the user and desired output style. This helps tailor ChatGPT’s responses to specific needs and preferences.
- “I’m a YouTuber that makes entertaining videos for those that work with data AKA data nerds give me concise answers and ignore all the Necessities that open I I programmed you with use emojis liberally use them to convey emotion or at the beginning of any Billet Point basically I don’t like Chach btb rambling so I use this in order to get concise answers quick anyway instead of providing this context every single time that I start a new chat chat gbt actually has things called custom instructions.”
1. Image Analysis for Data Insights: GPT-4’s image recognition capabilities are highlighted, showcasing how it can analyze data visualizations (graphs, charts) and data models to extract insights and generate code, streamlining complex analytical tasks.
- “so this analysis would have normally taken me minutes if not hours to do and now I just got this in a matter of seconds so I’m really blown away by this feature of Chachi BT”
1. Data Cleaning and Transformation: The tutorial walks through using ChatGPT for data cleaning tasks, such as removing unnecessary spaces and reformatting data, to prepare datasets for further analysis.
- “I prompted for the location column it appears that some values have unnecessary spaces we need to remove these spaces to better categorize this data nice nice and so it went through and re and it actually did it on its own it generated this new updated bar graph showing these locations once it cleaned it out and now we don’t have any duplicated anywhere or United States it’s pretty awesome”
1. Predictive Modeling with ChatGPT: Demonstrates how to leverage the Advanced Data Analysis plugin to build machine learning models (like random forest) for predicting variables like salary based on job-related data.
- “build a machine learning model to predict yearly salary use job title job platform and location as inputs into this model and I have at the end to suggest what models do you suggest using for this”
1. Awareness of Limitations and Workarounds: Openly discusses ChatGPT’s limitations with large datasets and internet access, offering solutions like splitting files and utilizing the Notable plugin for expanded functionality.
- “I try to upload the file and I get this message saying the file is too large maximum file size is 512 megabytes and that was around 250,000 rows of data now one trick you can take with this if you’re really close to that 512 megabytes is to compress it into a zip file”
Quotes:
- “Data nerds welcome to this tutorial on how to use chat TBT for DEA analytics…”
- “The Advanced Data analysis plug-in is by far one of the most powerful that I’ve seen within chat GPT…”
- “This is all a lot of work and we did this with not a single line of code, this is pretty awesome.”
Overall:

The tutorial aims to equip data professionals with the knowledge and skills to utilize ChatGPT effectively for data analysis, emphasizing the importance of proper prompting, exploring the plugin’s capabilities, and acknowledging and addressing limitations.

ChatGPT can efficiently automate many data analysis tasks, including data exploration, cleaning, descriptive statistics, exploratory data analysis, and predictive modeling [1-3].

Data Exploration
- ChatGPT can analyze a dataset and provide a description of each column. For example, given a dataset of data analyst job postings, ChatGPT can identify key information like company name, location, description, and salary [4, 5].
Data Cleaning
- ChatGPT can identify and clean up data inconsistencies. For instance, it can remove unnecessary spaces in a “job location” column and standardize the format of a “job platform” column [6-8].
Descriptive Statistics and Exploratory Data Analysis (EDA)
- ChatGPT can calculate and present descriptive statistics, such as count, mean, standard deviation, minimum, and maximum for numerical columns, and unique value counts and top frequencies for categorical columns. It can organize this information in an easy-to-read table format [9-11].
- ChatGPT can also perform EDA by generating appropriate visualizations like histograms for numerical data and bar charts for categorical data. For example, it can create visualizations to show the distribution of salaries, the top job titles and locations, and the average salary by job platform [12-18].
Predictive Modeling
- ChatGPT can build machine learning models to predict data. For example, it can create a model to predict yearly salary based on job title, platform, and location [19, 20].
- It can also suggest appropriate models based on the dataset and explain the model’s performance metrics, such as root mean square error (RMSE), to assess the model’s accuracy [21-23].
It is important to note that ChatGPT has some limitations, including internet access restrictions and file size limits. It also raises data security concerns, especially when dealing with sensitive information [24].

ChatGPT Functionality Across Different Models
- ChatGPT Plus, the paid version, offers access to the newest and most capable models, including GPT-4. This grants users features like faster response speeds, plugins, and Advanced Data Analysis. [1]
- ChatGPT Enterprise, primarily for companies, provides a similar interface to ChatGPT Plus but with enhanced security measures. This is suitable for handling sensitive data like HIPAA, confidential, or proprietary data. [2, 3]
- The free version of ChatGPT relies on the GPT 3.5 model. [4]
- The GPT-4 model offers significant advantages over the GPT 3.5 model, including:Internet browsing: GPT-4 can access and retrieve information from the internet, allowing it to provide more up-to-date and accurate responses, as seen in the example where it correctly identified the new CEO of OpenAI. [5-7]
- Advanced Data Analysis: GPT-4 excels in mathematical calculations and provides accurate results even for complex word problems, unlike GPT 3.5, which relies on language prediction and can produce inaccurate calculations. [8-16]
- Image Analysis: GPT-4 can analyze images, including graphs and data models, extracting insights and providing interpretations. This is helpful for understanding complex visualizations or generating SQL queries based on data models. [17-27]
Overall, the newer GPT-4 model offers more advanced capabilities, making it suitable for tasks requiring internet access, accurate calculations, and image analysis.

ChatGPT’s Limitations and Workarounds for Data Analysis

ChatGPT has limitations related to internet access, file size limits, and data security. These limitations can hinder data analysis tasks. However, there are workarounds to address these issues.

Internet Access
- ChatGPT’s Advanced Data Analysis feature cannot connect to online data sources due to security concerns. This includes databases, APIs that stream data, and online data sources like Google Sheets [1].
- Workaround: Download the data from the online source and import it into ChatGPT [1].
File Size Limits
- ChatGPT has a file size limit of 512 megabytes for data imports. Attempting to upload a file larger than this limit will result in an error message [2].
- The total data set size limit is 2 GB. [3]
- Workarounds:Compress the data file into a zip file to reduce its size. This may allow you to import files that are slightly larger than 512 MB [2].
- Split the data into smaller files, each under the 512 MB limit, and import them separately. You can then work with the combined data within ChatGPT [3].
- Use the Notable plugin, discussed in a later chapter of the source material, to connect to larger data sets and online data sources [3].
Data Security
- Using the free or plus versions of ChatGPT for sensitive data, such as proprietary data, confidential data, or HIPAA-protected health information, raises security concerns. This is because data in these versions can potentially be used to train ChatGPT models, even if chat history is turned off [4, 5].
- Workaround: Consider using ChatGPT Enterprise Edition for secure data analysis. This edition is designed for handling sensitive data, with certifications like SOC 2 to ensure data security. Data in this edition is not used for training [5, 6].
It is important to note that these limitations and workarounds are based on the information provided in the sources, which may not be completely up-to-date. It is always recommended to verify the accuracy of this information with ChatGPT and OpenAI documentation.

ChatGPT Plus and ChatGPT Enterprise

The sources provide information about ChatGPT Plus and ChatGPT Enterprise, two options for accessing ChatGPT.

ChatGPT Plus

ChatGPT Plus is the paid version of ChatGPT, costing about $20 per month in the United States [1]. It offers several benefits over the free version:
- Access to Newer Models: ChatGPT Plus subscribers have access to the newest and most capable language models, including GPT-4 [1]. This model has features like internet browsing, Advanced Data Analysis, and image analysis, which are not available in the free version [2-5].
- Faster Response Speeds: ChatGPT Plus provides faster response times compared to the free version [6].
- Access to Plugins: ChatGPT Plus allows users to access plugins that extend the functionality of ChatGPT [3]. One example mentioned is the Notable plugin, which is useful for working with large datasets and connecting to online data sources [7, 8].
ChatGPT Plus is a suitable option for freelancers, contractors, job seekers, and individuals within companies who need access to the advanced features of GPT-4 and plugins [1].

ChatGPT Enterprise

ChatGPT Enterprise is designed for companies and organizations [3]. It provides a similar interface to ChatGPT Plus but with enhanced security features [3].
- Enhanced Security: ChatGPT Enterprise solves data security problems by offering a secure environment for handling sensitive data, including HIPAA-protected data, confidential information, and proprietary data [9].
- Compliance: ChatGPT Enterprise is SOC 2 compliant, meeting the same security compliance standards as many cloud providers like Google Cloud and Amazon Web Services [10]. This makes it suitable for organizations that require strict data security measures.
While the sources don’t specify the cost of ChatGPT Enterprise, they imply that companies purchase a subscription, and employees access it through the company’s service [3].

Choosing Between ChatGPT Plus and ChatGPT Enterprise

The choice between ChatGPT Plus and ChatGPT Enterprise depends on the user’s needs and the type of data being analyzed.
- Individual users or those working with non-sensitive data may find ChatGPT Plus sufficient.
- Organizations dealing with sensitive data should consider ChatGPT Enterprise to ensure data security and compliance.
The sources also mention that ChatGPT Enterprise is a worthwhile investment for companies looking to implement a powerful data analysis tool [11].

Here are the key features of ChatGPT Plus as described in the sources and our conversation history:
- Access to the newest and most capable models, including GPT-4: ChatGPT Plus subscribers get to use the latest and greatest large language models, like GPT-4. This access gives them an advantage in leveraging the most advanced capabilities of ChatGPT, including internet browsing, Advanced Data Analysis, and image analysis [1, 2]. These features are not available in the free version, which relies on the older GPT 3.5 model [3, 4].
- Faster response speeds: Compared to the free version of ChatGPT, ChatGPT Plus offers faster response times [2]. This means less waiting for the model to generate text and process information.
- Access to plugins: ChatGPT Plus users can utilize plugins to expand the functionality of ChatGPT [2]. A notable example mentioned in the sources is the “Notable plugin”, designed for managing and exploring large datasets and connecting to online data sources [5-7]. This overcomes some limitations of the built-in Advanced Data Analysis feature, specifically the restrictions on accessing online data sources and handling large files [8, 9].
The sources emphasize that ChatGPT Plus caters to various users, including freelancers, contractors, job seekers, and individuals within companies [1]. These individuals can benefit from GPT-4’s advanced features and plugin access to enhance their productivity and efficiency.

Key Differences Between ChatGPT Plus and ChatGPT Enterprise

The sources highlight the distinct features and target users of ChatGPT Plus and ChatGPT Enterprise. Both offer access to ChatGPT, but cater to different user needs and data security requirements.

ChatGPT Plus is marketed as a paid upgrade from the free version of ChatGPT, primarily aimed at individuals.
- Pricing: ChatGPT Plus costs $20 per month in the United States. [1]
- Target Users: The sources identify freelancers, contractors, job seekers, and individuals within companies as the intended user base for ChatGPT Plus. [1] This suggests that ChatGPT Plus caters to individuals seeking enhanced features and faster response times for personal or professional projects.
ChatGPT Enterprise focuses on addressing the security and compliance needs of organizations dealing with sensitive data.
- Pricing: The sources do not explicitly state the cost of ChatGPT Enterprise, but imply that companies subscribe to the service, with employees accessing it through the company’s account. [2-4]
- Target Users: ChatGPT Enterprise targets companies and organizations, particularly those handling sensitive data, such as HIPAA-protected health information, confidential information, and proprietary data. [2-4]
Here’s a breakdown of the key differences between the two options:

Feature Comparison

The decision between ChatGPT Plus and ChatGPT Enterprise depends on the user’s needs and the nature of the data involved:
- ChatGPT Plus: Suitable for individuals working with non-sensitive data and requiring the advanced capabilities of GPT-4 and plugins.
- ChatGPT Enterprise: Ideal for organizations handling sensitive data, prioritizing data security, and needing compliance with industry standards.
The sources suggest that ChatGPT Enterprise offers significant value for companies looking to leverage ChatGPT as a powerful data analysis tool while ensuring data protection and compliance.

Key Features of ChatGPT Plus

The sources highlight ChatGPT Plus as a paid subscription service that offers several advantages over the free version of ChatGPT, specifically targeting individual users who require advanced features and faster processing.

Here are the key features of ChatGPT Plus:
- Access to Newer Models, Including GPT-4: ChatGPT Plus subscribers gain access to the most recent and capable language models, notably GPT-4. This access grants them the ability to utilize features not available in the free version, including internet browsing, Advanced Data Analysis, and image analysis. [1-5] These features significantly enhance the functionality of ChatGPT, enabling it to perform more complex and insightful analyses. For instance, GPT-4’s ability to browse the internet allows it to provide up-to-date information, as demonstrated in the example where it correctly identifies the new CEO of OpenAI, unlike the free version that relies on the older GPT 3.5 model with a knowledge cutoff. [6-8] The Advanced Data Analysis capability of GPT-4 ensures accurate mathematical calculations, even for complex word problems, unlike the GPT 3.5 model that relies on language prediction and can produce inaccurate calculations. [9-13] Furthermore, GPT-4’s image analysis capability enables it to analyze images, including graphs and data models, extracting insights and providing interpretations, a feature not available in the free version. [4, 14-23]
- Faster Response Speeds: ChatGPT Plus boasts faster response times compared to the free version. [2] This translates to reduced waiting time for the model to generate text, process information, and perform analyses.
- Access to Plugins: ChatGPT Plus users can access plugins that extend the functionality of ChatGPT. [24] The sources specifically mention the “Notable plugin” as an example. [25-28] This plugin is valuable for working with large datasets and connecting to online data sources, overcoming the limitations of the built-in Advanced Data Analysis feature regarding file size and internet access restrictions. [27, 29, 30]
The sources position ChatGPT Plus as a suitable option for various individual users, including freelancers, contractors, job seekers, and individuals within companies. [1] By subscribing to ChatGPT Plus, these users can leverage the advanced features of GPT-4 and plugins to improve their productivity and efficiency in their respective fields.

Key Distinctions Between ChatGPT Plus and ChatGPT Enterprise

The sources primarily focus on ChatGPT Plus, providing a detailed explanation of its features and benefits. They offer less information on ChatGPT Enterprise, but emphasize its focus on enhanced security for businesses handling sensitive data.

Here’s a breakdown of their key differences:

Target Audience and Use Cases
- ChatGPT Plus: Primarily targets individuals, such as freelancers, contractors, job seekers, and individuals within companies who require the advanced features of GPT-4 and plugins. It is suitable for personal or professional projects involving non-sensitive data. [1, 2]
- ChatGPT Enterprise: Designed for companies and organizations, especially those dealing with sensitive data like HIPAA-protected data, confidential information, and proprietary data. [2-4]
Features and Capabilities
- ChatGPT Plus: $20 per month in the United States. [5, 15]
- ChatGPT Enterprise: Pricing not specified in the sources, but it is purchased by companies for their employees to use. [3]
Security Focus
- ChatGPT Plus: While users can disable chat history to prevent their data from being used for training, the sources raise concerns about the security of proprietary, confidential, or HIPAA-protected data in the Plus version. [2, 12, 13]
- ChatGPT Enterprise: Specifically designed to address data security concerns. It provides a secure environment for sensitive data and is SOC 2 compliant, offering assurance that the data is handled responsibly and securely. [2, 4, 14]
Choosing the Right Option

The choice between ChatGPT Plus and ChatGPT Enterprise hinges on the user’s needs and the sensitivity of the data.
- For individuals working with non-sensitive data and requiring GPT-4’s advanced features and plugins, ChatGPT Plus is a suitable option. [1, 2]
- For organizations handling sensitive data and requiring stringent security measures and compliance, ChatGPT Enterprise is the recommended choice. [2-4]
The sources highlight the value proposition of ChatGPT Enterprise for companies seeking a robust data analysis tool with enhanced security and compliance features. [16] They also suggest contacting company management to explore the feasibility of implementing ChatGPT Enterprise if its features align with the organization’s needs. [16]

Limitations of ChatGPT’s Advanced Data Analysis

While ChatGPT’s Advanced Data Analysis offers powerful capabilities for data analysis tasks, the sources point out several limitations, particularly concerning internet access, data size limitations, and security considerations.

Restricted Internet Access

ChatGPT’s Advanced Data Analysis feature cannot directly connect to online data sources for security reasons [1]. This limitation prevents users from directly analyzing data from online databases, APIs that stream data, or even cloud-based spreadsheets like Google Sheets [1]. To analyze data from these sources, users must first download the data and then upload it to ChatGPT [1].

This restriction can be inconvenient and time-consuming, particularly when dealing with frequently updated data or large datasets that require constant access to the online source. It also hinders the ability to perform real-time analysis on streaming data, limiting the potential applications of Advanced Data Analysis in dynamic data environments.

File Size Limitations

ChatGPT’s Advanced Data Analysis feature has restrictions on the size of data files that can be uploaded and analyzed [2]. The maximum file size allowed is 512 megabytes [2]. In the example provided, attempting to upload a CSV file larger than this limit results in an error message [2]. This limitation can be problematic when working with large datasets common in many data analysis scenarios.

While there is a total dataset size limit of 2 GB, users must split larger datasets into smaller files to upload them to ChatGPT [3]. This workaround can be cumbersome, especially for datasets with millions of rows. It also necessitates additional steps for combining and processing the results from analyzing the separate files, adding complexity to the workflow.

Data Security Concerns

The sources raise concerns regarding data security when using ChatGPT Plus, particularly for sensitive data [4, 5]. Even with chat history turned off to prevent data from being used for training, there is no guarantee that proprietary, confidential, or HIPAA-protected data is fully secure in the Plus version [5].

This lack of clarity regarding data protection in ChatGPT Plus raises concerns for organizations handling sensitive information. Uploading such data to ChatGPT Plus might expose it to potential risks, even if unintentional. The sources advise against uploading sensitive data to ChatGPT Plus until clear assurances and mechanisms are in place to guarantee its security and confidentiality.

The sources suggest ChatGPT Enterprise as a more secure option for handling sensitive data [6]. ChatGPT Enterprise is designed with enhanced security measures to prevent data use for training and is SOC 2 compliant [6]. This compliance standard, similar to those followed by major cloud providers, offers a higher level of assurance regarding data security and responsible handling [6].

The sources recommend contacting company management to discuss implementing ChatGPT Enterprise if the organization deals with sensitive data and requires a secure and compliant environment for data analysis [7]. This proactive approach ensures that data security is prioritized and that the chosen version of ChatGPT aligns with the organization’s security policies and requirements.

Notable Plugin as a Workaround

The sources mention the Notable plugin as a potential workaround for the internet access and file size limitations of the Advanced Data Analysis feature [3, 8]. This plugin enables connecting to online data sources and handling larger datasets, overcoming some of the constraints of the built-in feature [8].

The Notable plugin appears to offer a more flexible and robust solution for data analysis within ChatGPT. Its ability to connect to external data sources and manage larger datasets expands the possibilities for data analysis tasks, enabling users to work with a wider range of data sources and volumes.

However, the sources do not provide specific details about the Notable plugin’s features, capabilities, or security considerations. It is essential to consult the plugin’s documentation and explore its functionality further to assess its suitability for specific data analysis tasks and data security requirements.

Supported File Types for ChatGPT’s Advanced Data Analysis

The sources offer a glimpse into the file types compatible with ChatGPT’s Advanced Data Analysis. However, the information is not presented as a definitive list, and it emphasizes that prompting ChatGPT effectively is crucial for uncovering the full range of supported file types.

Initially, when asked about compatible file types, ChatGPT lists only CSV, Excel, and JSON [1]. The user recognizes this as an incomplete response and prompts for a more comprehensive list, leading to the revelation that the feature supports a broader range of file types [1].

Expanded List of File Types

The expanded list includes:
- Databases: The specific database types are not mentioned, but this suggests compatibility with common database formats like SQL databases.
- SPSS and SAS files: This indicates support for data files commonly used in statistical analysis and research.
- HTML: Support for HTML files suggests potential for web scraping and extracting data from web pages. [1]
Importance of Prompting and Potential for More

The sources emphasize that prompting ChatGPT effectively is crucial for discovering the full extent of its capabilities, including the range of supported file types [1]. The initial limited response regarding file types highlights the importance of being specific and persistent in prompting ChatGPT to obtain more comprehensive information.

This interaction suggests that ChatGPT’s Advanced Data Analysis might support even more file types than explicitly mentioned in the sources. Further exploration and experimentation with different file types and prompts might reveal additional supported formats.

Key Takeaways
- ChatGPT’s Advanced Data Analysis supports a variety of file types commonly used in data analysis.
- The provided list of file types might not be exhaustive.
- Effective prompting is key to unlocking the full potential of ChatGPT and discovering all supported file types.
The sources encourage users to experiment with different file types and prompts to discover the full range of compatible formats for ChatGPT’s Advanced Data Analysis. They suggest that this feature might possess capabilities beyond those explicitly described, waiting to be uncovered through exploration and effective interaction with the language model.

A Typical Data Analysis Workflow Using ChatGPT

The sources provide a detailed walkthrough of a data analysis project using ChatGPT, outlining a step-by-step workflow that leverages the platform’s capabilities for data import, exploration, cleaning, visualization, and predictive modeling. Here’s a breakdown of the typical steps involved:

1. Data Acquisition and Import
- Identify and Download Dataset: Begin by selecting a dataset relevant to your analysis goals. The sources demonstrate this using a dataset of data analyst job postings from Kaggle, a platform known for hosting diverse datasets [1].
- Import Dataset into ChatGPT: Utilize ChatGPT’s Advanced Data Analysis plugin to import the downloaded dataset. The plugin supports various file types, including CSV, Excel, JSON, database formats, SPSS, SAS, and HTML [2, 3]. The sources emphasize that prompting ChatGPT effectively is crucial to uncovering the full range of supported file types [3].
2. Data Exploration and Understanding
- Explore Data Structure and Columns: Once imported, prompt ChatGPT to provide information about the dataset, including a description of each column and their data types [4]. This step helps understand the dataset’s composition and identify potential areas for cleaning or transformation.
- Perform Descriptive Statistics: Request ChatGPT to calculate descriptive statistics for each column, such as count, mean, standard deviation, minimum, maximum, and frequency. The sources recommend organizing these statistics into tables for easier comprehension [5, 6].
- Conduct Exploratory Data Analysis (EDA): Visualize the data using appropriate charts and graphs, such as histograms for numerical data and bar charts for categorical data. This step helps uncover patterns, trends, and relationships within the data [7]. The sources highlight the use of histograms to understand salary distributions and bar charts to analyze job titles, locations, and job platforms [8, 9].
3. Data Cleaning and Preparation
- Identify and Address Data Quality Issues: Based on the insights gained from descriptive statistics and EDA, pinpoint columns requiring cleaning or transformation [10]. This might involve removing unnecessary spaces, standardizing formats, handling missing values, or recoding categorical variables.
- Prompt ChatGPT for Data Cleaning Tasks: Provide specific instructions to ChatGPT for cleaning the identified columns. The sources showcase this by removing spaces in the “Location” column and standardizing the “Via” column to “Job Platform” [11, 12].
4. In-Depth Analysis and Visualization
- Formulate Analytical Questions: Define specific questions you want to answer using the data [13]. This step guides the subsequent analysis and visualization process.
- Visualize Relationships and Trends: Create visualizations that help answer your analytical questions. This might involve exploring relationships between variables, comparing distributions across different categories, or uncovering trends over time. The sources demonstrate this by visualizing average salaries across different job platforms, titles, and locations [14, 15].
- Iterate and Refine Visualizations: Based on initial visualizations, refine prompts and adjust visualization types to gain further insights. The sources emphasize the importance of clear and specific instructions to ChatGPT to obtain desired visualizations [16].
5. Predictive Modeling
- Define Prediction Goal: Specify the variable you want to predict using machine learning. The sources focus on predicting yearly salary based on job title, job platform, and location [17].
- Request Model Building and Selection: Prompt ChatGPT to build a machine learning model using the chosen variables as inputs. Allow ChatGPT to suggest appropriate model types based on the dataset’s characteristics [17]. The sources illustrate this by considering Random Forest, Gradient Boosting, and Linear Regression, ultimately selecting Random Forest based on ChatGPT’s recommendation [18].
- Evaluate Model Performance: Assess the accuracy of the built model using metrics like root mean square error (RMSE). Seek clarification from ChatGPT on interpreting these metrics to understand the model’s prediction accuracy [19].
- Test and Validate Predictions: Provide input values to ChatGPT based on the model’s variables and obtain predicted outputs [20]. Compare these predictions with external sources or benchmarks to validate the model’s reliability. The sources validate salary predictions against data from Glassdoor, a website that aggregates salary information [20].
6. Interpretation and Communication
- Summarize Key Findings: Consolidate the insights gained from the analysis, including descriptive statistics, visualizations, and model predictions [21]. This step provides a concise overview of the data’s key takeaways.
- Communicate Results Effectively: Present the findings in a clear and understandable format, using visualizations, tables, and concise explanations. Tailor the communication style to the target audience, whether it’s fellow data analysts, business stakeholders, or a wider audience.
Limitations to Consider

While ChatGPT’s Advanced Data Analysis offers a streamlined workflow for many data analysis tasks, it’s crucial to be mindful of its limitations, as highlighted in the sources:
- Restricted Internet Access: Inability to connect directly to online data sources necessitates downloading data before importing [22].
- File Size Limitations: Maximum file size of 512 MB requires splitting larger datasets into smaller files for upload [23].
- Data Security Concerns: Lack of clarity regarding data protection in ChatGPT Plus raises concerns for sensitive data. ChatGPT Enterprise offers enhanced security and compliance features [24, 25].
These limitations highlight the importance of considering the data’s size, sensitivity, and accessibility when deciding to utilize ChatGPT for data analysis.

Conclusion

ChatGPT’s Advanced Data Analysis plugin offers a powerful and accessible tool for streamlining the data analysis process. The workflow outlined in the sources demonstrates how ChatGPT can be leveraged to efficiently explore, clean, visualize, and model data, empowering users to extract valuable insights and make informed decisions. However, users must remain cognizant of the platform’s limitations and exercise caution when handling sensitive data.

Limitations of ChatGPT

The sources describe several limitations of ChatGPT, particularly concerning its Advanced Data Analysis plugin. These limitations revolve around internet access, file size restrictions, and data security.

Internet Access Restrictions

ChatGPT’s Advanced Data Analysis plugin, designed for data manipulation and analysis, cannot directly access online data sources due to security concerns [1]. This limitation prevents users from directly connecting to databases in the cloud, APIs that stream data, or online spreadsheets like Google Sheets [1]. Users must download data from these sources and then upload it into ChatGPT for analysis. This restriction highlights a potential inconvenience, especially when dealing with frequently updated or real-time data sources.

File Size Limitations

The Advanced Data Analysis plugin imposes a maximum file size limit of 512 MB [2]. Attempting to upload files larger than this limit will result in an error message, preventing the data from being imported [2]. While the plugin allows for a total dataset size of 2 GB, users must divide larger datasets into multiple smaller CSV files to circumvent the individual file size limitation [3]. This constraint might pose challenges when working with extensive datasets common in various data analysis scenarios.

Data Security Concerns

The sources express concerns about data security, especially when using the ChatGPT Plus plan [4, 5]. While users can disable chat history to prevent their data from being used to train ChatGPT models, the sources indicate that the level of data protection remains unclear [5]. They advise against uploading sensitive data, such as proprietary information, confidential data, or data protected by regulations like HIPAA, when using the ChatGPT Plus plan [5].

ChatGPT Enterprise as a Potential Solution

The sources suggest ChatGPT Enterprise as a more secure option for handling sensitive data [5, 6]. This enterprise edition boasts enhanced security and compliance features, including certifications like SOC 2, designed to ensure data protection [6]. Unlike ChatGPT Plus, data uploaded to ChatGPT Enterprise is not utilized for training models, providing greater assurance for users dealing with sensitive information [6].

Notable Plugin as a Workaround

The sources mention the Notable plugin as a potential workaround for the internet access and file size limitations [3]. This plugin offers capabilities to connect to online data sources and handle larger datasets, effectively addressing two key limitations of the Advanced Data Analysis plugin [7]. However, the sources do not provide details on how the Notable plugin overcomes these limitations.

Evolving Capabilities

It is important to note that ChatGPT is a rapidly evolving platform, and its features and limitations are subject to change [8]. OpenAI continuously updates and improves ChatGPT, and the sources acknowledge that these limitations might be addressed in future updates [9]. The mention of videos regarding limitations potentially being removed in the future indicates the dynamic nature of the platform [9, 10]. Users are encouraged to stay updated with the latest developments and explore new features as they become available.

Here are three limitations of ChatGPT’s data analysis capabilities described in the sources:
- Limited Internet Access: The Advanced Data Analysis plugin, integral to ChatGPT’s data analysis capabilities, operates under security restrictions that prevent it from directly accessing online data sources. This limitation necessitates a manual process where users must download data from sources like cloud databases, APIs, or online spreadsheets and then upload the data into ChatGPT for analysis [1]. This restriction poses a challenge when dealing with data sources that are frequently updated or provide real-time information.
- File Size Restrictions: The Advanced Data Analysis plugin enforces a maximum file size limit of 512 MB for uploads [2]. This restriction requires users to segment datasets larger than 512 MB into smaller CSV files before uploading, even though the plugin supports a total dataset size of up to 2 GB. This file segmentation requirement adds complexity to the data import process, particularly when handling very large datasets.
- Ambiguous Data Security: While ChatGPT offers the option to disable chat history to prevent data from being used for model training, concerns remain regarding the overall security of data, particularly for users of the ChatGPT Plus plan [3]. The sources advise against uploading sensitive information, such as confidential data, proprietary data, or data subject to regulations like HIPAA, when using the ChatGPT Plus plan due to uncertainties about data protection. This lack of clarity regarding data security might deter users from leveraging ChatGPT for tasks involving sensitive information.
Here are two options for accessing ChatGPT, based on the information in the sources:
- ChatGPT Plus: This paid subscription option offers users access to the newest and most capable language model (GPT-4 at the time the sources were created). It includes faster response speeds, access to plugins, and the Advanced Data Analysis feature. In the US, ChatGPT Plus costs about $20 per month. The sources note that ChatGPT Plus is a popular option for freelancers, contractors, job seekers, and even some individuals within companies. [1, 2]
- ChatGPT Enterprise: This option is similar to ChatGPT Plus but is accessed through a separate service, primarily for companies. With ChatGPT Enterprise, a company pays for access, and its employees can then use the platform. ChatGPT Enterprise addresses concerns about data security and is designed to handle sensitive data, including HIPAA, confidential, and proprietary data. ChatGPT Plus does not offer the same level of security, although the sources outline ways to safeguard data when using this version. [3, 4]
Here are three limitations of ChatGPT’s data analysis capabilities, based on the information provided in the sources:
- Restricted Internet Access: The Advanced Data Analysis plugin, a key component of ChatGPT’s data analysis functionality, cannot directly access online data sources due to security concerns [1, 2]. This limitation necessitates manual data retrieval from sources like cloud databases, APIs, or online spreadsheets. Users must download data from these sources and then upload the data into ChatGPT for analysis [2]. This restriction can be inconvenient, particularly when working with data sources that are updated frequently or offer real-time data streams.
- File Size Limitations: The Advanced Data Analysis plugin imposes a maximum file size limit of 512 MB for individual file uploads [3]. Although the plugin can handle datasets up to 2 GB in total size, datasets exceeding the 512 MB limit must be segmented into multiple, smaller CSV files before being uploaded [3]. This requirement to divide larger datasets into smaller files introduces complexity to the data import process.
- Data Security Ambiguity: While ChatGPT provides the option to disable chat history to prevent data from being used for model training, concerns regarding data security persist, particularly for users of the ChatGPT Plus plan [4, 5]. The sources suggest that the overall level of data protection in the ChatGPT Plus plan remains uncertain [5]. Users handling sensitive data, such as proprietary information, confidential data, or HIPAA-protected data, are advised to avoid using ChatGPT Plus due to these uncertainties [5]. The sources recommend ChatGPT Enterprise as a more secure alternative for handling sensitive data [6]. ChatGPT Enterprise implements enhanced security measures and certifications like SOC 2, which are designed to assure data protection [6].
Image Analysis Capabilities of ChatGPT

The sources detail how ChatGPT, specifically the GPT-4 model, can analyze images, going beyond its text-based capabilities. This feature opens up unique use cases for data analytics, allowing ChatGPT to interpret visual data like graphs and charts.

Analyzing Images for Insights

The sources illustrate this capability with an example where ChatGPT analyzes a bar chart depicting the top 10 in-demand skills for various data science roles. The model successfully identifies patterns, like similarities in skill requirements between data engineers and data scientists. This analysis, which could have taken a human analyst significant time, is completed by ChatGPT in seconds, highlighting the potential time savings offered by this feature.

Interpreting Unfamiliar Graphs

The sources suggest that ChatGPT can be particularly helpful in interpreting unfamiliar graphs, such as box plots. By inputting the image and prompting the model with a request like, “Explain this graph to me like I’m 5 years old,” users can receive a simplified explanation, making complex visualizations more accessible. This function can be valuable for users who may not have expertise in specific graph types or for quickly understanding complex data representations.

Working with Data Models

ChatGPT’s image analysis extends beyond graphs to encompass data models. The sources demonstrate this with an example where the model interprets a data model screenshot from Power BI, a business intelligence tool. When prompted with a query related to sales analysis, ChatGPT utilizes the information from the data model image to generate a relevant SQL query. This capability can significantly aid users in navigating and querying complex datasets represented visually.

Requirements and Limitations

The sources emphasize that this image analysis feature is only available in the most advanced GPT-4 model. Users need to ensure they are using this model and have the “Advanced Data Analysis” feature enabled.

While the sources showcase successful examples, it is important to note that ChatGPT’s image analysis capabilities may still have limitations. The sources describe an instance where ChatGPT initially struggled to analyze a graph provided as an image and required specific instructions to understand that it needed to interpret the visual data. This instance suggests that the model’s image analysis may not always be perfect and might require clear and specific prompts from the user to function effectively.

Improving Data Analysis Workflow with ChatGPT

The sources, primarily excerpts from a tutorial on using ChatGPT for data analysis, describe how the author leverages ChatGPT to streamline and enhance various stages of the data analysis process.

Automating Repetitive Tasks

The tutorial highlights ChatGPT’s ability to automate tasks often considered tedious and time-consuming for data analysts. This automation is particularly evident in:
- Descriptive Statistics: The author demonstrates how ChatGPT can efficiently generate descriptive statistics for each column in a dataset, presenting them in a user-friendly table format. This capability eliminates the need for manual calculations and formatting, saving analysts significant time and effort.
- Exploratory Data Analysis (EDA): The author utilizes ChatGPT to create various visualizations for EDA, such as histograms and bar charts, based on prompts that specify the desired visualization type and the data to be represented. This automation facilitates a quicker and more intuitive understanding of the dataset’s characteristics and potential patterns.
Simplifying Complex Analyses

The tutorial showcases how ChatGPT can make complex data analysis tasks more accessible, even for users without extensive coding experience. Examples include:
- Generating SQL Queries from Visual Data Models: The author demonstrates how ChatGPT can interpret screenshots of data models and generate SQL queries based on user prompts. This capability proves valuable for users who may not be proficient in SQL but need to extract specific information from a visually represented dataset.
- Building and Using Machine Learning Models: The tutorial walks through a process where ChatGPT builds a machine learning model to predict salary based on user-specified input features. The author then demonstrates how to use this model within ChatGPT to obtain predictions for different scenarios. This capability empowers users to leverage the power of machine learning without writing code.
Enhancing Efficiency and Insights

The sources emphasize how ChatGPT’s capabilities contribute to a more efficient and insightful data analysis workflow:
- Time Savings: The automation of tasks like generating descriptive statistics, creating visualizations, and building machine learning models significantly reduces the time required for these operations, allowing analysts to focus on higher-level tasks like interpretation and decision-making.
- Simplified Data Exploration: ChatGPT’s ability to analyze images and provide insights from graphs and charts empowers users to quickly understand data presented visually, even if they are unfamiliar with the specific graph type. This feature promotes accessibility and facilitates faster data exploration.
- Improved Accuracy: ChatGPT’s Advanced Data Analysis plugin leverages Python code to perform calculations, ensuring accuracy in results, as demonstrated in the tutorial’s example of calculating job applications based on complex word problems. This feature helps mitigate errors that can arise from manual calculations or estimations.
Limitations and Workarounds

While the sources advocate for ChatGPT’s benefits in data analysis, they also acknowledge its limitations:
- Internet Access Restrictions: ChatGPT’s inability to directly access online data sources requires manual data downloading and uploading, potentially hindering real-time analysis or work with frequently updated data.
- File Size Limitations: The file size constraints necessitate dataset segmentation for larger files, adding complexity to the data import process.
- Data Security Concerns: The ambiguity regarding data security, particularly with the ChatGPT Plus plan, raises concerns about using the platform for sensitive data. The sources recommend ChatGPT Enterprise for handling such data.
The sources mention the Notable plugin as a potential solution to the internet access and file size limitations. However, they do not provide specific details on how this plugin overcomes these challenges.

Steps to Build a Predictive Model in ChatGPT

The sources provide a detailed walkthrough of building a machine learning model within ChatGPT to predict yearly salary based on job-related attributes. Here’s a breakdown of the steps involved:
1. Define the Prediction Target and Input Features:
- Begin by clearly specifying what you want to predict (the target variable) and the factors that might influence this prediction (input features). In the source’s example, the goal is to predict yearly salary, and the chosen input features are job title, job platform, and location.
- This step requires an understanding of the data and the relationships between variables.
1. Prompt ChatGPT to Build the Model:
- Use a clear and concise prompt instructing ChatGPT to create a machine learning model for the specified prediction task. Include the target variable and the input features in your prompt.
- For example, the author used the prompt: “Build a machine learning model to predict yearly salary. Use job title, job platform, and location as inputs into this model.”
1. Consider Model Suggestions and Choose the Best Fit:
- ChatGPT might suggest several suitable machine learning models based on its analysis of the data and the prediction task. In the source’s example, ChatGPT recommended Random Forest, Gradient Boosting, and Linear Regression.
- You can either select a model you’re familiar with or ask ChatGPT to recommend the most appropriate model based on the data’s characteristics. The author opted for the Random Forest model, as it handles both numerical and categorical data well and is less sensitive to outliers.
1. Evaluate Model Performance:
- Once ChatGPT builds the model, it will provide statistics to assess its performance. Pay attention to metrics like Root Mean Square Error (RMSE), which indicates the average difference between the model’s predictions and the actual values.
- A lower RMSE indicates better predictive accuracy. The author’s model had an RMSE of around $22,000, meaning the predictions were, on average, off by that amount from the true yearly salaries.
1. Test the Model with Specific Inputs:
- To use the model for prediction, provide ChatGPT with specific values for the input features you defined earlier.
- The author tested the model with inputs like “Data Analyst in the United States for LinkedIn job postings.” ChatGPT then outputs the predicted yearly salary based on these inputs.
1. Validate Predictions Against External Sources:
- It’s crucial to compare the model’s predictions against data from reliable external sources to assess its real-world accuracy. The author used Glassdoor, a website that aggregates salary information, to validate the model’s predictions for different job titles and locations.
1. Fine-tune and Iterate (Optional):
- Based on the model’s performance and validation results, you can refine the model further by adjusting parameters, adding more data, or trying different algorithms. ChatGPT can guide this fine-tuning process based on your feedback and desired outcomes.
The sources emphasize that these steps allow users to build and use predictive models within ChatGPT without writing any code. This accessibility empowers users without extensive programming knowledge to leverage machine learning for various prediction tasks.

ChatGPT Models for Advanced Data Analysis

The sources, primarily excerpts from a tutorial on ChatGPT for data analysis, emphasize that access to Advanced Data Analysis capabilities depends on the specific ChatGPT model and plan you are using.
- ChatGPT Plus: This paid plan offers access to the most advanced models, including GPT-4 at the time of the tutorial’s creation. These models have built-in features like web browsing, image analysis, and most importantly, the Advanced Data Analysis functionality. To ensure you have access to this feature, you need to enable it in the “Beta features” section of your ChatGPT settings.
- GPT-4: The tutorial highlights GPT-4 as the recommended model for data analysis tasks, as it incorporates Advanced Data Analysis alongside other features like web browsing and image generation. You can select this model when starting a new chat in ChatGPT Plus.
- Data Analysis GPT: While the tutorial mentions a specific “Data Analysis GPT,” it notes that this model is limited to data analysis functions and lacks the additional features of GPT-4. It recommends using GPT-4 for a more comprehensive experience.
- ChatGPT Free and GPT-3.5: The sources imply that the free version of ChatGPT and the older GPT-3.5 model do not offer the Advanced Data Analysis functionality. While they can perform basic mathematical calculations, their accuracy and reliability for complex data analysis tasks are limited.
- ChatGPT Enterprise: This plan is geared towards organizations handling sensitive data. It offers enhanced security measures and compliance certifications, making it suitable for analyzing confidential or proprietary data. While the sources don’t explicitly state whether ChatGPT Enterprise includes Advanced Data Analysis, it’s reasonable to assume it does, given its focus on comprehensive data handling capabilities.
The tutorial consistently stresses the importance of using ChatGPT models equipped with Advanced Data Analysis for accurate and efficient data exploration, analysis, and prediction. It showcases the power of this feature through examples like generating descriptive statistics, creating visualizations, analyzing images of data models, and building machine learning models.

Handling Large Datasets in ChatGPT

The sources, focusing on a tutorial for data analysis with ChatGPT, provide insights into how the platform handles large datasets for analysis, particularly within the context of its Advanced Data Analysis plugin.
- File Size Limitations: The sources explicitly state that ChatGPT has a file size limit of 512 MB for individual files uploaded for analysis. This limitation applies even though ChatGPT can handle a total dataset size of up to 2 GB. [1, 2] This means that if you have a dataset larger than 512 MB, you cannot upload it as a single file.
- Dataset Segmentation: To overcome the file size limitation, the sources suggest splitting large datasets into smaller files before uploading them to ChatGPT. [2] For instance, if you have a 1 GB dataset, you would need to divide it into at least two smaller files, each under 512 MB, to import and analyze it in ChatGPT. This approach allows you to work with datasets exceeding the individual file size limit while still leveraging ChatGPT’s capabilities.
- Notable Plugin as a Potential Solution: The sources mention the Notable plugin as a potential workaround for both the internet access limitations and the file size constraints of the Advanced Data Analysis plugin. [2] However, the sources do not elaborate on how this plugin specifically addresses these challenges. Therefore, it remains unclear from the sources whether the Notable plugin allows for the analysis of datasets larger than 2 GB or enables direct connections to external data sources without manual downloading.
- Memory and Processing Constraints: While not explicitly mentioned, it’s important to consider that even with dataset segmentation, handling extremely large datasets within ChatGPT might push the boundaries of its processing capabilities. As the dataset size grows, ChatGPT might encounter memory limitations or experience slower processing times, potentially affecting the efficiency of analysis. This aspect is not addressed in the sources, so it’s essential to be mindful of potential performance issues when working with very large datasets.
- Alternative Solutions: The sources primarily focus on using the Advanced Data Analysis plugin within ChatGPT for data analysis. However, it’s worth noting that for handling very large datasets, alternative approaches might be more suitable. These alternatives could include using dedicated data analysis software like Python libraries (Pandas, NumPy, Scikit-learn), R, or specialized big data tools like Apache Spark, which are designed to handle massive datasets efficiently. These options are not discussed in the sources, but they represent valuable considerations for scenarios involving datasets that exceed ChatGPT’s practical handling capacity.
The sources provide a starting point for understanding how ChatGPT manages large datasets, but they leave some aspects unexplored. Further investigation into the Notable plugin’s capabilities and the potential performance implications of large datasets within ChatGPT would be beneficial.

Understanding Context and Tasks in ChatGPT Prompting

The sources, primarily excerpts from a ChatGPT for data analytics tutorial, provide valuable insights into how ChatGPT’s prompting system leverages context and tasks to deliver tailored and effective results.

1. Context as Background Information:
- The sources emphasize the importance of providing ChatGPT with relevant background information, referred to as context, to guide its responses. This context helps ChatGPT understand your perspective, expertise level, and desired output style. [1]
- For instance, a business student specializing in finance could provide the context: “I’m a business student specializing in Finance. I’m interested in finding insights within the financial industry.” [1] This context would prime ChatGPT to generate responses aligned with the student’s knowledge domain and interests.
2. Custom Instructions for Persistent Context:
- Rather than repeatedly providing the same context in each prompt, ChatGPT allows users to set custom instructions that establish a persistent context for all interactions. [2]
- These instructions are accessible through the settings menu, offering two sections: [2]
- “What would you like ChatGPT to know about you to provide better responses?” This section focuses on providing background information about yourself, your role, and your areas of interest. [2]
- “How would you like ChatGPT to respond?” This section guides the format, style, and tone of ChatGPT’s responses, such as requesting concise answers or liberal use of emojis. [2]
3. Task as the Specific Action or Request:
- The sources highlight the importance of clearly defining the task you want ChatGPT to perform. [3] This task represents the specific action, request, or question you are posing to the model.
- For example, if you want ChatGPT to analyze a dataset, your task might be: “Perform descriptive statistics on each column, grouping numeric and non-numeric columns into separate tables.” [4, 5]
4. The Power of Combining Context and Task:
- The sources stress that effectively combining context and task in your prompts significantly enhances the quality and relevance of ChatGPT’s responses. [3]
- By providing both the necessary background information and a clear instruction, you guide ChatGPT to generate outputs that are not only accurate but also tailored to your specific needs and expectations.
5. Limitations and Considerations:
- While custom instructions offer a convenient way to set a persistent context, it’s important to note that ChatGPT’s memory and ability to retain context across extended conversations might have limitations. The sources do not delve into these limitations. [6]
- Additionally, users should be mindful of potential biases introduced through their chosen context. A context that is too narrow or specific might inadvertently limit ChatGPT’s ability to explore diverse perspectives or generate creative outputs. This aspect is not addressed in the sources.
The sources provide a solid foundation for understanding how context and tasks function within ChatGPT’s prompting system. However, further exploration of potential limitations related to context retention and bias would be beneficial for users seeking to maximize the effectiveness and ethical implications of their interactions with the model.

Context and Task Enhancement of ChatGPT Prompting

The sources, primarily excerpts from a ChatGPT tutorial for data analytics, highlight how providing context and tasks within prompts significantly improves the quality, relevance, and effectiveness of ChatGPT’s responses.

Context as a Guiding Framework:
- The sources emphasize that context serves as crucial background information, helping ChatGPT understand your perspective, area of expertise, and desired output style [1]. Imagine you are asking ChatGPT to explain a concept. Providing context about your current knowledge level, like “Explain this to me as if I am a beginner in data science,” allows ChatGPT to tailor its response accordingly, using simpler language and avoiding overly technical jargon.
- A well-defined context guides ChatGPT to generate responses that are more aligned with your needs and expectations. For instance, a financial analyst using ChatGPT might provide the context: “I am a financial analyst working on a market research report.” This background information would prime ChatGPT to provide insights and analysis relevant to the financial domain, potentially suggesting relevant metrics, industry trends, or competitor analysis.
Custom Instructions for Setting the Stage:
- ChatGPT offers a feature called custom instructions to establish a persistent context that applies to all your interactions with the model [2]. You can access these instructions through the settings menu, where you can provide detailed information about yourself and how you want ChatGPT to respond. Think of custom instructions as setting the stage for your conversation with ChatGPT. You can specify your role, areas of expertise, preferred communication style, and any other relevant details that might influence the interaction.
- Custom instructions are particularly beneficial for users who frequently engage with ChatGPT for specific tasks or within a particular domain. For example, a data scientist regularly using ChatGPT for model building could set custom instructions outlining their preferred coding language (Python or R), their level of expertise in machine learning, and their typical project goals. This would streamline the interaction, as ChatGPT would already have a baseline understanding of the user’s needs and preferences.
Task as the Specific Action or Request:
- The sources stress that clearly stating the task is essential for directing ChatGPT’s actions [3]. The task represents the specific action, question, or request you are presenting to the model.
- Providing a well-defined task ensures that ChatGPT focuses on the desired outcome. For instance, instead of a vague prompt like “Tell me about data analysis,” you could provide a clear task like: “Create a Python code snippet to calculate the mean, median, and standard deviation of a list of numbers.” This specific task leaves no room for ambiguity and directs ChatGPT to produce a targeted output.
The Synergy of Context and Task:
- The sources highlight the synergistic relationship between context and task, emphasizing that combining both elements in your prompts significantly improves ChatGPT’s performance [3].
- By setting the stage with context and providing clear instructions with the task, you guide ChatGPT to deliver more accurate, relevant, and tailored responses. For example, imagine you are a marketing manager using ChatGPT to analyze customer feedback data. Your context might be: “I am a marketing manager looking to understand customer sentiment towards our latest product launch.” Your task could then be: “Analyze this set of customer reviews and identify the key themes and sentiment trends.” This combination of context and task allows ChatGPT to understand your role, your objective, and the specific action you require, leading to a more insightful and actionable analysis.
Beyond the Sources: Additional Considerations

It is important to note that while the sources provide valuable insights, they do not address potential limitations related to context retention and bias in ChatGPT. Further exploration of these aspects is essential for users seeking to maximize the effectiveness and ethical implications of their interactions with the model.

Leveraging Custom Instructions in the ChatGPT Tutorial

The sources, primarily excerpts from a data analytics tutorial using ChatGPT, illustrate how the tutorial effectively utilizes custom instructions to enhance the learning experience and guide ChatGPT to generate more relevant responses.

1. Defining User Persona for Context:
- The tutorial encourages users to establish a clear context by defining a user persona that reflects their role, area of expertise, and interests. This persona helps ChatGPT understand the user’s perspective and tailor responses accordingly.
- For instance, the tutorial provides an example of a YouTuber creating content for data enthusiasts, using the custom instruction: “I’m a YouTuber that makes entertaining videos for those that work with data AKA data nerds. Give me concise answers and ignore all the Necessities that OpenAI programmed you with. Use emojis liberally use them to convey emotion or at the beginning of any bullet point.” This custom instruction establishes a specific context, signaling ChatGPT to provide concise, engaging responses with a touch of humor, suitable for a YouTube audience interested in data.
2. Shaping Response Style and Format:
- Custom instructions go beyond simply providing background information; they also allow users to shape the style, format, and tone of ChatGPT’s responses.
- The tutorial demonstrates how users can request specific formatting, such as using tables for presenting data or incorporating emojis to enhance visual appeal. For example, the tutorial guides users to request descriptive statistics in a table format, making it easier to interpret the data: “Perform descriptive statistics on each column, but also for this group numeric and non-numeric columns such as those categorical columns into different tables with each column as a row.”
- This level of customization empowers users to tailor ChatGPT’s output to their preferences, whether they prefer concise bullet points, detailed explanations, or creative writing styles.
3. Streamlining Interactions for Specific Use Cases:
- By establishing a persistent context through custom instructions, the tutorial demonstrates how to streamline interactions with ChatGPT, particularly for users engaging with the model for specific tasks or within a particular domain.
- Imagine a marketing professional consistently using ChatGPT for analyzing customer sentiment. By setting custom instructions that state their role and objectives, such as “I am a marketing manager focused on understanding customer feedback to improve product development,” they provide ChatGPT with valuable background information.
- This pre-defined context eliminates the need to repeatedly provide the same information in each prompt, allowing for more efficient and focused interactions with ChatGPT.
4. Guiding Data Analysis with Context:
- The tutorial showcases how custom instructions play a crucial role in guiding data analysis within ChatGPT. By setting context about the user’s data analysis goals and preferences, ChatGPT can generate more relevant insights and visualizations.
- For instance, when analyzing salary data, a user might specify in their custom instructions that they are primarily interested in comparing salaries across different job titles within the data science field. This context would inform ChatGPT’s analysis, prompting it to focus on relevant comparisons and provide visualizations tailored to the user’s specific interests.
5. Limitations Not Explicitly Addressed:

While the tutorial effectively demonstrates the benefits of using custom instructions, it does not explicitly address potential limitations related to context retention and bias. Users should be mindful that ChatGPT’s ability to retain context over extended conversations might have limitations, and custom instructions, if too narrow or biased, could inadvertently limit the model’s ability to explore diverse perspectives. These aspects, while not mentioned in the sources, are essential considerations for responsible and effective use of ChatGPT.

Comparing ChatGPT Access Options: Plus vs. Enterprise

The sources, focusing on a ChatGPT data analytics tutorial, primarily discuss the ChatGPT Plus plan and briefly introduce the ChatGPT Enterprise edition, highlighting their key distinctions regarding features, data security, and target users.

ChatGPT Plus:
- This plan represents the most common option for individuals, including freelancers, contractors, job seekers, and even some employees within companies. [1]
- It offers access to the latest and most capable language model, which, at the time of the tutorial, was GPT-4. This model includes features like web browsing, image generation with DALL-E, and the crucial Advanced Data Analysis plugin central to the tutorial’s content. [2, 3]
- ChatGPT Plus costs approximately $20 per month in the United States, granting users faster response speeds, access to plugins, and the Advanced Data Analysis functionality. [2, 4]
- However, the sources raise concerns about the security of sensitive data when using ChatGPT Plus. They suggest that even with chat history disabled, it’s unclear whether data remains confidential and protected from potential misuse. [5, 6]
- The tutorial advises against uploading proprietary, confidential, or HIPAA-protected data to ChatGPT Plus, recommending the Enterprise edition for such sensitive information. [5, 6]
ChatGPT Enterprise:
- Unlike the Plus plan, which caters to individuals, ChatGPT Enterprise targets companies and organizations concerned about data security. [4]
- It operates through a separate service, with companies paying for access, and their employees subsequently utilizing the platform. [4]
- ChatGPT Enterprise specifically addresses the challenges of working with secure data, including HIPAA-protected, confidential, and proprietary information. [7]
- It ensures data security by not using any information for training and maintaining strict confidentiality. [7]
- The sources emphasize that ChatGPT Enterprise complies with SOC 2, a security compliance standard followed by major cloud providers, indicating a higher level of data protection compared to the Plus plan. [5, 8]
- While the sources don’t explicitly state the pricing for ChatGPT Enterprise, it’s safe to assume that it differs from the individual-focused Plus plan and likely involves organizational subscriptions.
The sources primarily concentrate on ChatGPT Plus due to its relevance to the data analytics tutorial, offering detailed explanations of its features and limitations. ChatGPT Enterprise receives a more cursory treatment, primarily focusing on its enhanced data security aspects. The sources suggest that ChatGPT Enterprise, with its robust security measures, serves as a more suitable option for organizations dealing with sensitive information compared to the individual-oriented ChatGPT Plus plan.

Page-by-Page Summary of “622-ChatGPT for Data Analytics Beginner Tutorial.pdf” Excerpts

The sources provide excerpts from what appears to be the transcript of a data analytics tutorial video, likely hosted on YouTube. The tutorial focuses on using ChatGPT, particularly the Advanced Data Analysis plugin, to perform various data analysis tasks, ranging from basic data exploration to predictive modeling.

Page 1:
- This page primarily contains the title of the tutorial: “ChatGPT for Data Analytics Beginner Tutorial.”
- It also includes links to external resources, specifically a transcript tool (https://anthiago.com/transcript/) and a YouTube video link. However, the complete YouTube link is truncated in the source.
- The beginning of the transcript suggests that the tutorial is intended for a data-focused audience (“data nerds”), promising insights into how ChatGPT can automate data analysis tasks, saving time and effort.
Page 2:
- This page outlines the two main sections of the tutorial:
- Basics of ChatGPT: This section covers fundamental aspects like understanding ChatGPT options (Plus vs. Enterprise), setting up ChatGPT Plus, best practices for prompting, and even utilizing ChatGPT’s image analysis capabilities to interpret graphs.
- Advanced Data Analysis: This section focuses on the Advanced Data Analysis plugin, demonstrating how to write and read code without manual coding, covering steps in the data analysis pipeline from data import and exploration to cleaning, visualization, and even basic machine learning for prediction.
Page 3:
- This page reinforces the beginner-friendly nature of the tutorial, assuring users that no prior experience in data analysis or coding is required. It reiterates that the tutorial content can be applied to create a showcaseable data analytics project using ChatGPT.
- It also mentions that the tutorial video is part of a larger course on ChatGPT for data analytics, highlighting the course’s offerings:
- Over 6 hours of video content
- Step-by-step exercises
- Capstone project
- Certificate of completion
- Interested users can find more details about the course at a specific timestamp in the video or through a link in the description.
Page 4:
- This page emphasizes the availability of supporting resources, including:
- The dataset used for the project
- Chat history transcripts to follow along with the tutorial
- It then transitions to discussing the options for accessing and using ChatGPT, introducing the ChatGPT Plus plan as the preferred choice for the tutorial.
Page 5:
- This page focuses on setting up ChatGPT Plus, providing step-by-step instructions:
1. Go to openai.com and select “Try ChatGPT.”
2. Sign up using a preferred method (e.g., Google credentials).
3. Verify your email address.
4. Accept terms and conditions.
5. Upgrade to the Plus plan (costing $20 per month at the time of the tutorial) to access GPT-4 and its advanced capabilities.
Page 6:
- This page details the payment process for ChatGPT Plus, requiring credit card information for the $20 monthly subscription. It reiterates the necessity of ChatGPT Plus for the tutorial due to its inclusion of GPT-4 and its advanced features.
- It instructs users to select the GPT-4 model within ChatGPT, as it includes the browsing and analysis capabilities essential for the course.
- It suggests bookmarking chat.openai.com for easy access.
Page 7:
- This page introduces the layout and functionality of ChatGPT, acknowledging a recent layout change in November 2023. It assures users that potential discrepancies between the tutorial’s interface and the current ChatGPT version should not cause concern, as the core functionality remains consistent.
- It describes the main elements of the ChatGPT interface:Sidebar: Contains GPT options, chat history, referral link, and settings.
- Chat Area: The space for interacting with the GPT model.
Page 8:
- This page continues exploring the ChatGPT interface:
- GPT Options: Allows users to choose between different GPT models (e.g., GPT-4, GPT-3.5) and explore custom-built models for specific functions. The tutorial highlights a custom-built “data analytics” GPT model linked in the course exercises.
- Chat History: Lists previous conversations, allowing users to revisit and rename them.
- Settings: Provides options for theme customization, data controls, and enabling beta features like plugins and Advanced Data Analysis.
Page 9:
- This page focuses on interacting with ChatGPT through prompts, providing examples and tips:
- It demonstrates a basic prompt (“Who are you and what can you do?”) to understand ChatGPT’s capabilities and limitations.
- It highlights features like copying, liking/disliking responses, and regenerating responses for different perspectives.
- It emphasizes the “Share” icon for creating shareable links to ChatGPT outputs.
- It encourages users to learn keyboard shortcuts for efficiency.
Page 10:
- This page transitions to a basic exercise for users to practice prompting:
- Users are instructed to prompt ChatGPT with questions similar to “Who are you and what can you do?” to explore its capabilities.
- They are also tasked with loading the custom-built “data analytics” GPT model into their menu for quizzing themselves on course content.
Page 11:
- This page dives into basic prompting techniques and the importance of understanding prompts’ structure:
- It emphasizes that ChatGPT’s knowledge is limited to a specific cutoff date (April 2023 in this case).
- It illustrates the “hallucination” phenomenon where ChatGPT might provide inaccurate or fabricated information when it lacks knowledge.
- It demonstrates how to guide ChatGPT to use specific features, like web browsing, to overcome knowledge limitations.
- It introduces the concept of a “prompt” as a message or instruction guiding ChatGPT’s response.
Page 12:
- This page continues exploring prompts, focusing on the components of effective prompting:
- It breaks down prompts into two parts: context and task.
- Context provides background information, like the user’s role or perspective.
- Task specifies what the user wants ChatGPT to do.
- It emphasizes the importance of providing both context and task in prompts to obtain desired results.
Page 13:
- This page introduces custom instructions as a way to establish persistent context for ChatGPT, eliminating the need to repeatedly provide background information in each prompt.
- It provides an example of custom instructions tailored for a YouTuber creating data-focused content, highlighting the desired response style: concise, engaging, and emoji-rich.
- It explains how to access and set up custom instructions in ChatGPT’s settings.
Page 14:
- This page details the two dialogue boxes within custom instructions:
- “What would you like ChatGPT to know about you to provide better responses?” This box is meant for context information, defining the user persona and relevant background.
- “How would you like ChatGPT to respond?” This box focuses on desired response style, including formatting, tone, and language.
- It emphasizes enabling the “Enabled for new chats” option to ensure custom instructions apply to all new conversations.
Page 15:
- This page covers additional ChatGPT settings:
- “Settings and Beta” tab:Theme: Allows switching between dark and light mode.
- Beta Features: Enables access to new features being tested, specifically recommending enabling plugins and Advanced Data Analysis for the tutorial.
- “Data Controls” tab:Chat History and Training: Controls whether user conversations are used to train ChatGPT models. Disabling this option prevents data from being used for training but limits chat history storage to 30 days.
- Security Concerns: Discusses the limitations of data security in ChatGPT Plus, particularly for sensitive data, and recommends ChatGPT Enterprise for enhanced security and compliance.
Page 16:
- This page introduces ChatGPT’s image analysis capabilities, highlighting its relevance to data analytics:
- It explains that GPT-4, the most advanced model at the time of the tutorial, allows users to upload images for analysis. This feature is not available in older models like GPT-3.5.
- It emphasizes that image analysis goes beyond analyzing pictures, extending to interpreting graphs and visualizations relevant to data analysis tasks.
Page 17:
- This page demonstrates using image analysis to interpret graphs:
- It shows an example where ChatGPT analyzes a Python code snippet from a screenshot.
- It then illustrates a case where ChatGPT initially fails to interpret a bar chart directly from the image, requiring the user to explicitly instruct it to view and analyze the uploaded graph.
- This example highlights the need to be specific in prompts and sometimes explicitly guide ChatGPT to use its image analysis capabilities effectively.
Page 18:
- This page provides a more practical data analytics use case for image analysis:
- It presents a complex bar chart visualization depicting top skills for different data science roles.
- By uploading the image, ChatGPT analyzes the graph, identifying patterns and relationships between skills across various roles, saving the user considerable time and effort.
Page 19:
- This page further explores the applications of image analysis in data analytics:
- It showcases how ChatGPT can interpret graphs that users might find unfamiliar or challenging to understand, such as a box plot representing data science salaries.
- It provides an example where ChatGPT explains the box plot using a simple analogy, making it easier for users to grasp the concept.
- It extends image analysis beyond visualizations to interpreting data models, such as a data model screenshot from Power BI, demonstrating how ChatGPT can generate SQL queries based on the model’s structure.
Page 20:
- This page concludes the image analysis section with an exercise for users to practice:
- It encourages users to upload various images, including graphs and data models, provided below the text (though the images themselves are not included in the source).
- Users are encouraged to explore ChatGPT’s capabilities in analyzing and interpreting visual data representations.
Page 21:
- This page marks a transition point, highlighting the upcoming section on the Advanced Data Analysis plugin. It also promotes the full data analytics course, emphasizing its more comprehensive coverage compared to the tutorial video.
- It reiterates the benefits of using ChatGPT for data analysis, claiming potential time savings of up to 20 hours per week.
Page 22:
- This page begins a deeper dive into the Advanced Data Analysis plugin, starting with a note about potential timeout issues:
- It explains that because the plugin allows file uploads, the environment where Python code executes and files are stored might time out, leading to a warning message.
- It assures users that this timeout issue can be resolved by re-uploading the relevant file, as ChatGPT retains previous analysis and picks up where it left off.
Page 23:
- This page officially introduces the chapter on the Advanced Data Analysis plugin, outlining a typical workflow using the plugin:
- It focuses on analyzing a dataset of data science job postings, covering steps like data import, exploration, cleaning, basic statistical analysis, visualization, and even machine learning for salary prediction.
- It reminds users to check for supporting resources like the dataset, prompts, and chat history transcripts provided below the video.
- It acknowledges that ChatGPT, at the time, couldn’t share images directly, so users wouldn’t see generated graphs in the shared transcripts, but they could still review the prompts and textual responses.
Page 24:
- This page begins a comparison between using ChatGPT with and without the Advanced Data Analysis plugin, aiming to showcase the plugin’s value.
- It clarifies that the plugin was previously a separate feature but is now integrated directly into the GPT-4 model, accessible alongside web browsing and DALL-E.
- It reiterates the importance of setting up custom instructions to provide context for ChatGPT, ensuring relevant responses.
Page 25:
- This page continues the comparison, starting with GPT-3.5 (without the Advanced Data Analysis plugin):
- It presents a simple word problem involving basic math calculations, which GPT-3.5 successfully solves.
- It then introduces a more complex word problem with larger numbers. While GPT-3.5 attempts to solve it, it produces an inaccurate result, highlighting the limitations of the base model for precise numerical calculations.
Page 26:
- This page explains the reason behind GPT-3.5’s inaccuracy in the complex word problem:
- It describes large language models like GPT-3.5 as being adept at predicting the next word in a sentence, showcasing this with the “Jack and Jill” nursery rhyme example and a simple math equation (2 + 2 = 4).
- It concludes that GPT-3.5, lacking the Advanced Data Analysis plugin, relies on its general knowledge and pattern recognition to solve math problems, leading to potential inaccuracies in complex scenarios.
Page 27:
- This page transitions to using ChatGPT with the Advanced Data Analysis plugin, explaining how to enable it:
- It instructs users to ensure the “Advanced Data Analysis” option is turned on in the Beta Features settings.
- It highlights two ways to access the plugin:
- Selecting the GPT-4 model within ChatGPT, which includes browsing, DALL-E, and analysis capabilities.
- Using the dedicated “Data Analysis” GPT model, which focuses solely on data analysis functionality. The tutorial recommends the GPT-4 model for its broader capabilities.
Page 28:
- This page demonstrates the accuracy of the Advanced Data Analysis plugin:
- It presents the same complex word problem that GPT-3.5 failed to solve accurately.
- This time, using the plugin, ChatGPT provides the correct answer, showcasing its precision in numerical calculations.
- It explains how users can “View Analysis” to see the Python code executed by the plugin, providing transparency and allowing for code inspection.
Page 29:
- This page explores the capabilities of the Advanced Data Analysis plugin, listing various data analysis tasks it can perform:
- Data analysis, statistical analysis, data processing, predictive modeling, data interpretation, custom queries.
- It concludes with an exercise for users to practice:
- Users are instructed to prompt ChatGPT with the same question (“What can you do with this feature?”) to explore the plugin’s capabilities.
- They are also tasked with asking ChatGPT about the types of files it can import for analysis.
Page 30:
- This page focuses on connecting to data sources, specifically importing a dataset for analysis:
- It reminds users of the exercise to inquire about supported file types. It mentions that ChatGPT initially provided a limited list (CSV, Excel, JSON) but, after a more specific prompt, revealed a wider range of supported formats, including database files, SPSS, SAS, and HTML.
- It introduces a dataset of data analyst job postings hosted on Kaggle, a platform for datasets, encouraging users to download it.
Page 31:
- This page guides users through uploading and initially exploring the downloaded dataset:
- It instructs users to upload the ZIP file directly to ChatGPT without providing specific instructions.
- ChatGPT successfully identifies the ZIP file, extracts its contents (a CSV file), and prompts the user for the next steps in data analysis.
- The tutorial then demonstrates a prompt asking ChatGPT to provide details about the dataset, specifically a brief description of each column.
Page 32:
- This page continues exploring the dataset, focusing on understanding its columns:
- ChatGPT provides a list of columns with brief descriptions, highlighting key information contained in the dataset, such as company name, location, job description, and various salary-related columns.
- It concludes with an exercise for users to practice:
- Users are instructed to download the dataset from Kaggle, upload it to ChatGPT, and explore the columns and their descriptions.
- The tutorial hints at upcoming analysis using descriptive statistics.
Page 33:
- This page starts exploring the dataset through descriptive statistics:
- It demonstrates a basic prompt asking ChatGPT to “perform descriptive statistics on each column.”
- It explains the concept of descriptive statistics, including count, mean, standard deviation, minimum, maximum for numerical columns, and unique value counts and top frequencies for categorical columns.
Page 34:
- This page continues with descriptive statistics, highlighting the need for prompt refinement to achieve desired formatting:
- It notes that ChatGPT initially struggles to provide descriptive statistics for the entire dataset, suggesting a need for analysis in smaller parts.
- The tutorial then refines the prompt, requesting ChatGPT to group numeric and non-numeric columns into separate tables, with each column as a row, resulting in a more organized and interpretable output.
Page 35:
- This page presents the results of the refined descriptive statistics prompt:
- It showcases tables for both numerical and non-numerical columns, allowing for a clear view of statistical summaries.
- It points out specific insights, such as the missing values in the salary column, highlighting potential data quality issues.
Page 36:
- This page transitions from descriptive statistics to exploratory data analysis (EDA), focusing on visualizing the dataset:
- It introduces EDA as a way to visually represent descriptive statistics through graphs like histograms and bar charts.
- It demonstrates a prompt asking ChatGPT to perform EDA, providing appropriate visualizations for each column, such as using histograms for numerical columns.
Page 37:
- This page showcases the results of the EDA prompt, presenting various visualizations generated by ChatGPT:
- It highlights bar charts depicting distributions for job titles, companies, locations, and job platforms.
- It points out interesting insights, like the dominance of LinkedIn as a job posting platform and the prevalence of “Anywhere” and “United States” as job locations.
Page 38:
- This page concludes the EDA section with an exercise for users to practice:
- It encourages users to replicate the descriptive statistics and EDA steps, requesting them to explore the dataset further and familiarize themselves with its content.
- It hints at the next video focusing on data cleaning before proceeding with further visualization.
Page 39:
- This page focuses on data cleanup, using insights from previous descriptive statistics and EDA to identify columns requiring attention:
- It mentions two specific columns for cleanup:
- “Job Location”: Contains inconsistent spacing, requiring removal of unnecessary spaces for better categorization.
- “Via”: Requires removing the prefix “Via ” and renaming the column to “Job Platform” for clarity.
Page 40:
- This page demonstrates ChatGPT performing the data cleanup tasks:
- It shows ChatGPT successfully removing unnecessary spaces from the “Job Location” column, presenting an updated bar chart reflecting the cleaned data.
- It also illustrates ChatGPT removing the “Via ” prefix and renaming the column to “Job Platform” as instructed.
Page 41:
- This page concludes the data cleanup section with an exercise for users to practice:
- It instructs users to clean up the “Job Platform” and “Job Location” columns as demonstrated.
- It encourages exploring and cleaning other columns as needed based on previous analyses.
- It hints at the next video diving into more complex visualizations.
Page 42:
- This page begins exploring more complex visualizations, specifically focusing on the salary data and its relationship to other columns:
- It reminds users of the previously cleaned “Job Location” and “Job Platform” columns, emphasizing their relevance to the upcoming analysis.
- It revisits the descriptive statistics for salary data, describing various salary-related columns (average, minimum, maximum, hourly, yearly, standardized) and explaining the concept of standardized salary.
Page 43:
- This page continues analyzing salary data, focusing on the “Salary Yearly” column:
- It presents a histogram showing the distribution of yearly salaries, noting the expected range for data analyst roles.
- It briefly explains the “Hourly” and “Standardized Salary” columns, but emphasizes that the focus for the current analysis will be on “Salary Yearly.”
Page 44:
- This page demonstrates visualizing salary data in relation to job platforms, highlighting the importance of clear and specific prompting:
- It showcases a bar chart depicting average yearly salaries for the top 10 job platforms. However, it notes that the visualization is not what the user intended, as it shows the platforms with the highest average salaries, not the 10 most common platforms.
- This example emphasizes the need for careful wording in prompts to avoid misinterpretations by ChatGPT.
Page 45:
- This page corrects the previous visualization by refining the prompt, emphasizing the importance of clarity:
- It demonstrates a revised prompt explicitly requesting the average salaries for the 10 most common job platforms, resulting in the desired visualization.
- It discusses insights from the corrected visualization, noting the absence of freelance platforms (Upwork, BB) due to their focus on hourly rates and highlighting the relatively high average salary for “AI Jobs.net.”
Page 46:
- This page concludes the visualization section with an exercise for users to practice:
- It instructs users to replicate the analysis for job platforms, visualizing average salaries for the top 10 most common platforms.
- It extends the exercise to include similar visualizations for job titles and locations, encouraging exploration of salary patterns across these categories.
Page 47:
- This page recaps the visualizations created in the previous exercise, highlighting key insights:
- It discusses the bar charts for job titles and locations, noting the expected salary trends for different data analyst roles and observing the concentration of high-paying locations in specific states (Kansas, Oklahoma, Missouri).
Page 48:
- This page transitions to the concept of predicting data, specifically focusing on machine learning to predict salary:
- It acknowledges the limitations of previous visualizations in exploring multiple conditions simultaneously (e.g., analyzing salary based on both location and job title) and introduces machine learning as a solution.
- It demonstrates a prompt asking ChatGPT to build a machine learning model to predict yearly salary using job title, platform, and location as inputs, requesting model suggestions.
Page 49:
- This page discusses the model suggestions provided by ChatGPT:
- It lists three models: Random Forest, Gradient Boosting, and Linear Regression.
- It then prompts ChatGPT to recommend the most suitable model for the dataset.
Page 50:
- This page reveals ChatGPT’s recommendation, emphasizing the reasoning behind it:
- ChatGPT suggests Random Forest as the best model, explaining its advantages: handling both numerical and categorical data, robustness to outliers (relevant for salary data).
- The tutorial proceeds with building the Random Forest model.
Page 51:
- This page presents the results of the built Random Forest model:
- It provides statistics related to model errors, highlighting the root mean squared error (RMSE) of around $22,000.
- It explains the meaning of RMSE, indicating that the model’s predictions are, on average, off by about $22,000 from the actual yearly salary.
Page 52:
- This page focuses on testing the built model within ChatGPT:
- It instructs users on how to provide inputs to the model (location, title, platform) for salary prediction.
- It demonstrates an example predicting the salary for a “Data Analyst” in the United States using LinkedIn, resulting in a prediction of around $94,000.
Page 53:
- This page compares the model’s prediction to external salary data from Glassdoor:
- It shows that the predicted salary of $94,000 is within the expected range based on Glassdoor data (around $80,000), suggesting reasonable accuracy.
- It then predicts the salary for a “Senior Data Analyst” using the same location and platform, resulting in a higher prediction of $117,000, which aligns with the expected salary trend for senior roles.
Page 54:
- This page further validates the model’s prediction for “Senior Data Analyst”:
- It shows that the predicted salary of $117,000 is very close to the Glassdoor data for Senior Data Analysts (around $121,000), highlighting the model’s accuracy for this role.
- It discusses the observation that the model’s prediction for “Data Analyst” might be less accurate due to potential inconsistencies in job title classifications, with some “Data Analyst” roles likely including senior-level responsibilities, skewing the data.
Page 55:
- This page concludes the machine learning section with an exercise for users to practice:
- It encourages users to replicate the model building and testing process, allowing them to use the same attributes (location, title, platform) or explore different inputs.
- It suggests comparing model predictions to external salary data sources like Glassdoor to assess accuracy.
Page 56:
- This page summarizes the entire data analytics pipeline covered in the chapter, emphasizing its comprehensiveness and the lack of manual coding required:
- It lists the steps: data collection, EDA, cleaning, analysis, model building for prediction.
- It highlights the potential of using this project as a portfolio piece to demonstrate data analysis skills using ChatGPT.
Page 57:
- This page emphasizes the practical value and time-saving benefits of using ChatGPT for data analysis:
- It shares the author’s personal experience, mentioning how tasks that previously took a whole day can now be completed in minutes using ChatGPT.
- It clarifies that the techniques demonstrated are particularly suitable for ad hoc analysis, quick explorations of datasets. For more complex or ongoing analyses, the tutorial recommends using other ChatGPT plugins, hinting at upcoming chapters covering these tools.
Page 58:
- This page transitions to discussing limitations of the Advanced Data Analysis plugin, noting that these limitations might be addressed in the future, rendering this section obsolete.
- It outlines three main limitations:
- Internet access: The plugin cannot connect directly to online data sources (databases, APIs, cloud spreadsheets) due to security reasons, requiring users to download data manually.
- File size: Individual files uploaded to the plugin are limited to 512 MB, even though the total dataset size limit is 2 GB. This restriction necessitates splitting large datasets into smaller files.
- Data security: Concerns about the confidentiality of sensitive data persist, even with chat history disabled. While the tutorial previously recommended ChatGPT Enterprise for secure data, it acknowledges the limitations of ChatGPT Plus for handling such information.
Page 59:
- This page continues discussing the limitations, focusing on potential workarounds:
- It mentions the Notable plugin as a potential solution for both internet access and file size limitations, but without providing details on its capabilities.
- It reiterates the data security concerns, advising against uploading sensitive data to ChatGPT Plus and highlighting ChatGPT Enterprise as a more secure option.
Page 60:
- This page provides a more detailed explanation of the data security concerns:
- It reminds users about the option to disable chat history, preventing data from being used for training.
- However, it emphasizes that this measure might not guarantee data confidentiality, especially for sensitive information.
- It again recommends ChatGPT Enterprise as a secure alternative for handling confidential, proprietary, or HIPAA-protected data, emphasizing its compliance with SOC 2 standards and its strict policy against using data for training.
Page 61:
- This page concludes the limitations section, offering a call to action:
- It encourages users working with secure data to advocate for adopting ChatGPT Enterprise within their organizations, highlighting its value for secure data analysis.
Page 62:
- This page marks the conclusion of the chapter on the Advanced Data Analysis plugin, emphasizing the accomplishments of the tutorial and the potential for future applications:
- It highlights the successful completion of a data analytics pipeline using ChatGPT, showcasing its power and efficiency.
- It encourages users to leverage the project for their portfolios, demonstrating practical skills in data analysis using ChatGPT.
- It reiterates the suitability of ChatGPT for ad hoc analysis, suggesting other plugins for more complex tasks, pointing towards upcoming chapters covering these tools.
Page 63:
- This final page serves as a wrap-up for the entire tutorial, offering congratulations and promoting the full data analytics course:
- It acknowledges the users’ progress in learning to use ChatGPT for data analysis.
- It encourages those who enjoyed the tutorial to consider enrolling in the full course for more in-depth knowledge and practical skills.
The sources, as excerpts from a data analytics tutorial, provide a step-by-step guide to using ChatGPT, particularly the Advanced Data Analysis plugin, for various data analysis tasks. The tutorial covers a wide range of topics, from basic prompting techniques to data exploration, cleaning, visualization, and even predictive modeling using machine learning. It emphasizes the practicality and time-saving benefits of using ChatGPT for data analysis while also addressing limitations and potential workarounds. The tutorial effectively guides users through practical examples and encourages them to apply their learnings to real-world data analysis scenarios.
- This tutorial covers using ChatGPT for data analytics, promising to save up to 20 hours a week.
- It starts with ChatGPT basics like prompting and using it to read graphs, then moves into advanced data analysis including writing and executing code without coding experience.
- The tutorial uses the GPT-4 model with browsing, analysis, plugins, and Advanced Data Analysis features, requiring a ChatGPT Plus subscription. It also includes a custom-built data analytics GPT for additional learning.
- A practical project analyzing data science job postings from a SQL database is included. The project will culminate in a shareable GitHub repository.
- No prior data analytics or coding experience is required.
- ChatGPT improves performance: A Harvard study found that ChatGPT users completed tasks 25% faster and with 40% higher quality.
- Advanced Data Analysis plugin: This powerful ChatGPT plugin allows users to upload files for analysis and insight generation.
- Plugin timeout issue: The Advanced Data Analysis plugin can timeout, requiring users to re-upload files, but retains previous analysis.
- Data analysis capabilities: The plugin supports descriptive statistics, exploratory data analysis (EDA), data cleaning, predictive modeling, and custom queries.
- Data cleaning example: The tutorial uses a dataset of data science job postings and demonstrates cleaning up inconsistencies in the “job location” column.
- Two data cleaning tasks were performed: removing extra spaces in “Job Location” and removing “via ” from the “Via” column, renaming it to “Job Platform.”
- Salary data analysis focused on the “Salary Yearly” column, ignoring hourly and monthly rates for visualization purposes.
- Visualizations were created for average yearly salary by the top 10 most common job platforms, job titles, and locations. Initial visualizations for job platforms were incorrect, requiring prompt refinement to specify “most common.”
- The visualizations revealed insights such as higher salaries for data scientist and lead data analyst roles, and higher salaries in locations like New York and California (though not in the top 10 most common locations).
- A machine learning model (Random Forest) was chosen to predict yearly salary based on job title, job platform, and location, due to its ability to handle categorical data and outliers.
- A random forest model was chosen for salary prediction due to its ability to handle both numerical and categorical data, as well as its lower sensitivity to outliers.
- The model’s root mean square error (RMSE) is approximately $22,000, meaning predictions are on average off by that amount.
- Testing the model with a “Data Analyst” role in the United States on LinkedIn yielded a predicted salary of $94,000, close to Glassdoor’s average of $80,000 (within the RMSE). A “Senior Data Analyst” test yielded a prediction of $117,000, very close to Glassdoor’s $121,000.
- The “Data Analyst” predictions were deemed less accurate, possibly due to job title inconsistencies (senior roles misclassified as data analyst).
- The tutorial demonstrated a complete data analytics pipeline (collection, cleaning, analysis, and modeling) without writing code, using ChatGPT and external resources like Glassdoor for validation.
Here is a discussion about the ChatGPT for Data Analytics Beginner Tutorial, based on the provided source:
- The tutorial is designed for data nerds who want to learn how to use ChatGPT for data analytics. [1]
- The tutorial covers a wide range of topics, including the basics of ChatGPT, advanced data analysis, and how to use ChatGPT to automate tasks. [2]
- The tutorial includes step-by-step exercises, a final Capstone project, and a certificate of completion. [3]
- The tutorial requires a ChatGPT Plus or ChatGPT Enterprise subscription. [4]
- The tutorial recommends using the GPT-4 model, which has browsing, analysis, and plugins. [5, 6]
- The tutorial covers the layout of ChatGPT, how to use prompts, and how to set up custom instructions. [7-9]
- The tutorial teaches you how to use ChatGPT to analyze images, including graphs and data models. [10, 11]
- The tutorial covers the Advanced Data Analysis plugin, which allows you to upload files, perform descriptive and exploratory data analysis, and build machine learning models. [12-14]
- The tutorial discusses the limitations of ChatGPT, including internet access, data limitations, and security concerns. [15]
The tutorial is a valuable resource for anyone who wants to learn how to use ChatGPT for data analytics. It provides a comprehensive overview of the tool and its capabilities, and it includes practical exercises and examples.

Data Analysis in the ChatGPT Tutorial

The sources focus on using ChatGPT for data analysis, highlighting its capabilities and the process involved. Here’s a breakdown of data analysis as presented in the tutorial:
- Advanced Data Analysis Plugin: This plugin is a core feature allowing users to upload data, analyze it, and generate insights [1, 2]. This plugin enables users to perform tasks without coding [3]. However, there are limitations regarding internet access, data size, and security concerns [4-6].
- Data Analysis Pipeline: The tutorial walks through a typical data analysis pipeline, starting with data collection and cleaning, then moving to exploratory data analysis and building machine learning models [2]. This pipeline allows users to gain valuable experience in handling data and extracting insights.
- Types of Analysis: The sources mention several types of analysis possible with ChatGPT:
- Descriptive statistics: Summarizing data with metrics like count, mean, standard deviation, minimum, and maximum [7].
- Exploratory Data Analysis (EDA): Visualizing data through histograms, bar charts, etc., to understand patterns and trends [8].
- Predictive Modeling: Building machine learning models to predict outcomes based on input data [9]. The example provided focuses on predicting salary based on job title, platform, and location [9].
- Data Cleaning: The tutorial emphasizes the importance of cleaning data before analysis, using examples like removing unnecessary spaces and standardizing column names [10, 11].
- Data Visualization: Visualizing data is key for understanding and communicating insights. The tutorial showcases using ChatGPT to generate various graphs and charts based on analyzed data [12-14].
- Machine Learning: The tutorial demonstrates building a machine learning model (random forest) to predict salary [15, 16]. It explains key concepts like RMSE (root mean squared error) to assess model accuracy [16].
The tutorial effectively demonstrates how ChatGPT can streamline the data analysis process, making it accessible even for beginners. It provides practical steps and explanations, empowering users to analyze data, extract insights, and build predictive models, all without needing to write code.

Advanced Analysis in ChatGPT: A Deep Dive

The sources provide a detailed look at using ChatGPT for advanced data analysis. While “Advanced analysis” isn’t explicitly defined, the content suggests it refers to leveraging ChatGPT’s capabilities beyond basic tasks, focusing on extracting deeper insights and building predictive models. Let’s break down the key aspects discussed:
- Advanced Data Analysis Plugin: This plugin serves as the foundation for advanced analysis in ChatGPT. It empowers users to perform intricate analyses without writing code, making it accessible for those without programming expertise.
- Understanding and Setting Up: The sources emphasize the importance of understanding the plugin’s functionalities and correctly setting up ChatGPT for optimal results. This includes:
- Choosing the Right Model: Opting for the GPT-4 model with browsing, analysis, and plugin access ensures you have the most advanced tools at your disposal.
- Custom Instructions: Defining your context and desired output style through custom instructions helps ChatGPT understand your needs and tailor its responses.
- Data Handling:Importing Data: The plugin accepts various file types, including CSV, Excel, JSON, and even zipped files, enabling analysis of data from diverse sources.
- Data Cleaning: The tutorial highlights the importance of data cleaning before analysis, demonstrating how to remove unnecessary spaces and standardize column names for consistency.
- Types of Advanced Analysis:Descriptive Statistics: Calculating metrics like count, mean, standard deviation, minimum, and maximum provides a numerical overview of your data.
- Exploratory Data Analysis (EDA): Visualizing data through histograms, bar charts, and other appropriate graphs helps identify patterns, trends, and potential areas for deeper investigation.
- Predictive Modeling: This is where the power of advanced analysis shines. The tutorial showcases building a machine learning model, specifically a random forest, to predict salary based on job title, platform, and location. It also explains how to interpret model accuracy using metrics like RMSE.
- Iterative Process: The sources emphasize that data analysis with ChatGPT is iterative. You start with a prompt, analyze the results, refine your prompts based on insights, and continue exploring until you achieve the desired outcome.
- Limitations to Consider: While powerful, the Advanced Data Analysis plugin has limitations:
- No Internet Access: It cannot directly connect to online databases, APIs, or cloud-based data sources. Data must be downloaded and then imported.
- File Size Restrictions: There’s a limit to the size of files (512MB) and the total dataset (2GB) you can upload.
- Security Concerns: The free and plus versions of ChatGPT might not be suitable for handling sensitive data due to potential privacy risks. The Enterprise Edition offers enhanced security measures for confidential data.
The tutorial showcases how ChatGPT can be a powerful tool for advanced data analysis, enabling users to go beyond basic summaries and generate valuable insights. By understanding its capabilities, limitations, and the iterative process involved, you can leverage ChatGPT effectively to streamline your data analysis workflow, even without extensive coding knowledge.

Data Visualization in the ChatGPT Tutorial

The sources emphasize the crucial role of data visualization in data analysis, demonstrating how ChatGPT can be used to generate various visualizations to understand data better.

Data visualization is essential for effectively communicating insights derived from data analysis. The tutorial highlights the following aspects of data visualization:
- Exploratory Data Analysis (EDA): EDA is a key application of data visualization. The tutorial uses ChatGPT to create visualizations like histograms and bar charts to explore the distribution of data in different columns. These visuals help identify patterns, trends, and potential areas for further investigation.
- Visualizing Relationships: The sources demonstrate using ChatGPT to plot data to understand relationships between different variables. For example, the tutorial visualizes the average yearly salary for the top 10 most common job platforms using a bar graph. This allows for quick comparisons and insights into how salary varies across different platforms.
- Appropriate Visuals: The tutorial stresses the importance of selecting the right type of visualization based on the data and the insights you want to convey. For example, histograms are suitable for visualizing numerical data distribution, while bar charts are effective for comparing categorical data.
- Interpreting Visualizations: The sources highlight that generating a visualization is just the first step. Proper interpretation of the visual is crucial for extracting meaningful insights. ChatGPT can help with interpretation, but users should also develop their skills in understanding and analyzing visualizations.
- Iterative Process: The tutorial advocates for an iterative process in data visualization. As you generate visualizations, you gain new insights, which might lead to the need for further analysis and refining the visualizations to better represent the data.
The ChatGPT tutorial demonstrates how the platform simplifies the data visualization process, allowing users to create various visuals without needing coding skills. It empowers users to explore data, identify patterns, and communicate insights effectively through visualization, a crucial skill for any data analyst.

Machine Learning in the ChatGPT Tutorial

The sources highlight the application of machine learning within ChatGPT, demonstrating its use in building predictive models as part of advanced data analysis. While the tutorial doesn’t offer a deep dive into machine learning theory, it provides practical examples and explanations to illustrate how ChatGPT can be used to build and utilize machine learning models, even for users without extensive coding experience.

Here’s a breakdown of the key aspects of machine learning discussed in the sources:
- Predictive Modeling: The tutorial emphasizes the use of machine learning for building predictive models. This involves training a model on a dataset to learn patterns and relationships, allowing it to predict future outcomes based on new input data. The example provided focuses on predicting yearly salary based on job title, job platform, and location.
- Model Selection: The sources guide users through the process of selecting an appropriate machine learning model for a specific task. In the example, ChatGPT suggests three potential models: Random Forest, Gradient Boosting, and Linear Regression. The tutorial then explains factors to consider when choosing a model, such as the type of data (numerical and categorical), sensitivity to outliers, and model complexity. Based on these factors, ChatGPT recommends using the Random Forest model for the salary prediction task.
- Model Building and Training: The tutorial demonstrates how to use ChatGPT to build and train the selected machine learning model. The process involves feeding the model with the chosen dataset, allowing it to learn the patterns and relationships between the input features (job title, platform, location) and the target variable (salary). The tutorial doesn’t go into the technical details of the model training process, but it highlights that ChatGPT handles the underlying code and calculations, making it accessible for users without programming expertise.
- Model Evaluation: Once the model is trained, it’s crucial to evaluate its performance to understand how well it can predict future outcomes. The tutorial explains the concept of RMSE (Root Mean Squared Error) as a metric for assessing model accuracy. It provides an interpretation of the RMSE value obtained for the salary prediction model, indicating the average deviation between predicted and actual salaries.
- Model Application: After building and evaluating the model, the tutorial demonstrates how to use it for prediction. Users can provide input data (e.g., job title, platform, location) to the model through ChatGPT, and it will generate a predicted salary based on the learned patterns. The tutorial showcases this by predicting salaries for different job titles and locations, comparing the results with data from external sources like Glassdoor to assess real-world accuracy.
The ChatGPT tutorial effectively demonstrates how the platform can be used for practical machine learning applications. It simplifies the process of building, training, evaluating, and utilizing machine learning models for prediction, making it accessible for users of varying skill levels. The tutorial focuses on applying machine learning within a real-world data analysis context, showcasing its potential for generating valuable insights and predictions.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
December 6, 2024