Category: SQL

  • SQL for Data Analysis: Pivoting, Stats, Segmentation, Views, and Optimization

    SQL for Data Analysis: Pivoting, Stats, Segmentation, Views, and Optimization

    The initial source, “01.pdf,” details an investigation of a database schema, focusing on understanding table relationships and performing revenue calculations. It demonstrates how to join tables to incorporate customer and product information, filter data for recent sales, and categorize customers based on revenue using case statements. The subsequent sections introduce the concept of pivoting data using aggregation and case statements, illustrating techniques for analyzing customer counts by region and calculating net revenue across different categories and years. Further content explores statistical functions within pivoting for median sales analysis and advances into segmentation using multiple conditions within case statements for detailed revenue breakdowns. The final portion of the provided text transitions into date calculations, covering functions like date_trunc and date_part for time series analysis, extracting date components, and using intervals and the age function to analyze processing times. This culminates in an introduction to window functions, explaining their syntax and application in calculations over partitions of data without collapsing rows, including ranking and running aggregates, and finally examining frame clauses for controlling the data within a window.

    Source Material Study Guide

    Quiz

    1. Explain the purpose and importance of the %sql magic command in the provided text. What happens if you try to run a SQL query without it?
    2. Describe what magic commands are in the context of the source material. Give at least two examples of magic commands mentioned besides %sql and explain their function.
    3. What is the Contoso database, and what are some of the key tables within it that are discussed in the excerpts? Briefly describe the purpose of the Sales, Customer, and Date tables.
    4. Explain how net revenue is calculated in the context of the Sales table. What columns are used in this calculation, and why is net price used instead of unit price?
    5. Describe the process of joining tables in SQL as demonstrated in the excerpts. What type of join is frequently used, and on what columns are the tables typically joined in the examples?
    6. What is a CASE WHEN statement, and how is it used in the provided text? Give an example of how it’s used to categorize data within a SQL query.
    7. Explain the concept of pivoting data as introduced in the “pivoting with case statements” section. How is the COUNT(DISTINCT CASE WHEN … END) syntax used to achieve this?
    8. Describe the DATE_TRUNC and EXTRACT functions as explained in the text. What are they used for, and what are some examples of date parts that can be extracted?
    9. Explain the purpose and basic syntax of a Common Table Expression (CTE). How are CTEs used to structure more complex SQL queries in the examples provided?
    10. Briefly describe the functionality of window functions as introduced in the excerpts. How do they differ from aggregate functions with a GROUP BY clause?

    Quiz Answer Key

    1. The %sql magic command is crucial for indicating to the Jupyter Notebook environment that the subsequent lines of code should be interpreted and executed as SQL queries. Without it, the code will be treated as regular Python code, leading to syntax errors and incorrect execution.
    2. Magic commands are special commands in the Jupyter Notebook environment that extend its functionality. Besides %sql, examples include %timeit which measures the execution time of the next line of code, and single % followed by Python code which executes that line as Python without timing.
    3. The Contoso database is the data set used throughout the course. Key tables discussed include Sales (containing transaction information like price and quantity), Customer (containing customer details like name and location), and Date (intended for date-based aggregations but later suggested to be ignored for learning date functions).
    4. Net revenue is calculated by multiplying the quantity of a product by its net price and the exchange rate. The net price is used because it represents the actual price charged to the customer after all discounts and adjustments.
    5. Joining tables combines rows from two or more tables based on a related column. Left joins are frequently used to keep all rows from the left table and matching rows from the right table. Tables are typically joined on key columns like ProductKey, CustomerKey, and date columns.
    6. A CASE WHEN statement allows for conditional logic within a SQL query, enabling the assignment of different values based on specified conditions. For example, it’s used to categorize customers as “high” or “low” value based on their net revenue.
    7. Pivoting data transforms rows into columns. The COUNT(DISTINCT CASE WHEN condition THEN column END) syntax is used to count the distinct occurrences of a specific column based on whether a certain condition is met, effectively creating new columns for each category.
    8. DATE_TRUNC extracts a specified date part (e.g., month, year) from a date, while EXTRACT similarly retrieves a part of a date. They are used for analyzing data based on different time granularities. Examples of extractable parts include ‘year’, ‘month’, ‘day of week’.
    9. A Common Table Expression (CTE) is a temporary, named result set defined within the scope of a single query. CTEs are used to break down complex queries into smaller, more manageable, and readable parts, often used before a final SELECT statement or when joining to the same subquery multiple times.
    10. Window functions perform calculations across a set of table rows that are related to the current row, but unlike aggregate functions with GROUP BY, they do not collapse the rows into a single output row. They allow for calculations like running totals, rankings, and averages within partitions of data.

    Essay Format Questions

    1. Discuss the importance of using magic commands in the context of interactive SQL querying within a Jupyter Notebook environment. How do they facilitate the integration of SQL with other programming languages like Python, as suggested in the excerpts?
    2. Analyze the strategy of exploring the Contoso database by examining individual tables (Sales, Customer, Date) and then combining them through joins. What are the benefits and potential challenges of this approach to understanding a new database schema?
    3. Evaluate the use of CASE WHEN statements in SQL for data categorization and pivoting, as demonstrated in the source material. Provide examples of scenarios where these techniques would be particularly valuable for data analysis and reporting.
    4. Compare and contrast the DATE_TRUNC and EXTRACT functions for manipulating date data in SQL. In what situations might one function be preferred over the other, and how do they contribute to more effective time-based analysis?
    5. Explain the role and advantages of using Common Table Expressions (CTEs) in writing complex SQL queries. How do CTEs improve query readability and maintainability, and can you provide a hypothetical example (based on the source material) where a CTE would significantly simplify a query?

    Glossary of Key Terms

    • Magic Command: Special commands in interactive environments like Jupyter Notebooks that provide extra functionality beyond the standard language syntax (e.g., %sql, %timeit).
    • SQL: Structured Query Language, a standard language for accessing and manipulating databases.
    • Jupyter Notebook: An interactive web-based environment for creating and sharing documents that contain live code, equations, visualizations, and narrative text.
    • Database: An organized collection of structured information, or data, typically stored electronically in a computer system.
    • Table: A structure within a database that organizes data into rows and columns.
    • Column: A vertical attribute or field in a table, containing a specific type of data for each record.
    • Row: A horizontal record in a table, representing a single instance or entry.
    • Query: A request for data or information from a database.
    • Syntax: The set of rules that govern the structure and format of statements in a programming or query language.
    • Autocomplete: A feature where the environment suggests or automatically completes code as the user is typing.
    • Syntax Error: An error in the structure or grammar of a statement that prevents it from being correctly interpreted.
    • Execution Time: The amount of time it takes for a program or query to run and complete.
    • Net Revenue: The actual revenue received after accounting for discounts, returns, and other adjustments.
    • Unit Price: The standard price of a single unit of a product before any discounts or adjustments.
    • Quantity: The number of units of a product.
    • Join: An SQL operation that combines rows from two or more tables based on a related column.
    • Left Join: Returns all rows from the left table and the matching rows from the right table. If there’s no match in the right table, NULLs are used for the columns of the right table.
    • Alias: A temporary name given to a table or column in a query to make it easier to refer to.
    • CASE WHEN Statement: A conditional expression in SQL that allows for different results based on specified conditions.
    • Pivoting: A data transformation technique that rotates rows into columns.
    • Aggregation: The process of summarizing data using functions like COUNT, SUM, AVG, MIN, and MAX.
    • DATE_TRUNC: An SQL function that truncates a timestamp or date value to a specified precision (e.g., day, month, year).
    • EXTRACT: An SQL function that retrieves a specific component (e.g., year, month, day) from a date or timestamp value.
    • CAST: An SQL operator used to change the data type of an expression.
    • Common Table Expression (CTE): A temporary, named result set defined within the execution of a single SQL statement.
    • Window Function: An SQL function that performs a calculation across a set of table rows that are related to the current row, without collapsing the rows.
    • Partition By: A clause used with window functions to divide the rows into partitions within which the function is applied.
    • Order By: A clause used to sort the rows within a result set or within a window function’s partition.

    Jupyter, SQL, and Database Exploration with PostgreSQL

    ## Briefing Document: Analysis of Provided Sources

    This briefing document summarizes the main themes and important ideas presented in the provided excerpts from “01.pdf”. The excerpts cover a range of topics related to using Jupyter Notebooks with SQL, exploring a database (Contoso), performing various SQL operations (including joins, aggregations, window functions, pivoting, date manipulation, and query optimization), and finally, setting up a local PostgreSQL environment with tools like pgAdmin and DBeaver for more robust database interaction and project management.

    **Main Themes:**

    1. **Introduction to Jupyter Notebooks and SQL Integration:** The initial sections focus on using Jupyter Notebooks with SQL through “magic commands” like `%sql`. This integration allows for writing and executing SQL queries directly within a Python environment.

    * **Key Idea:** The `%sql` magic command is crucial for executing SQL within a Jupyter Notebook cell. Without it, SQL syntax will be highlighted as incorrect and will result in errors.

    * **Quote:** “very important that you put these magic commands up at the top now so people don’t think i’m crazy magic commands are the actual official language of this”

    * **Key Idea:** Jupyter Notebook also supports other magic commands like `%timeit` for measuring code execution time, demonstrating the versatility of the environment. Single `%` applies the command to one line, while `%%` can apply to an entire cell (though not explicitly shown in the excerpt).

    2. **Exploring the Contoso Database Schema:** The sources introduce the Contoso database as the primary dataset for the course. The excerpts detail the exploration of key tables like `Sales`, `Customer`, and `Date`.

    * **Key Idea:** Understanding the relationships between tables (e.g., `Date` table related to `Sales` via date columns, `Customer` and `Product` tables linked to `Sales` via keys) is fundamental for querying and analysis.

    * **Quote:** “last table to explore is that date table this is related using that date column here to the sales table order date and delivery date.”

    * **Key Idea:** The `Date` table, while useful for quick filtering in tools like Power BI, will be largely ignored in the course in favor of learning more flexible date functions in SQL.

    3. **Performing Fundamental SQL Operations:** The excerpts illustrate core SQL concepts such as calculating net revenue, joining tables to combine data from different entities, and using aliases for tables and columns.

    * **Key Idea:** Net revenue is calculated as `quantity * net_price`. The `net_price` already accounts for discounts and promotions.

    * **Quote:** “the net price is the price after all the different discounts promotions or any adjustments so basically it’s what we actually charge to the customer when they pay for the product”

    * **Key Idea:** `LEFT JOIN` is used to combine tables while ensuring all rows from the left table are included. Aliases (e.g., `s` for `Sales`, `c` for `Customer`, `p` for `Product`) improve query readability.

    4. **Introduction to Pivoting Data with `CASE` Statements:** The sources introduce the concept of pivoting data, transforming rows into columns. This is achieved using `CASE WHEN` statements combined with aggregation functions like `COUNT` and `SUM`.

    * **Key Idea:** `CASE WHEN` allows for conditional logic within SQL queries, enabling the creation of new categories or columns based on existing data.

    * **Quote:** “we’re going to be using statements like case when and aggregation in order to pivot data but what the heck is pivoting data let’s take a look at this simple example”

    * **Key Idea:** Pivoting can be used to create summary tables where values from one column become headers in the output.

    5. **Date Manipulation with Functions like `DATE_TRUNC` and `TO_CHAR`:** The excerpts demonstrate how to extract specific parts of a date (e.g., month, year) using functions like `DATE_TRUNC` and `EXTRACT`, and how to format dates into desired string representations using `TO_CHAR`. Casting data types (e.g., to `DATE`) is also shown.

    * **Key Idea:** `DATE_TRUNC` allows truncating a date to a specified level of precision (e.g., month).

    * **Quote:** “specifically if you just want to specify one attribute you want to extract out of it such as something like month as we did you could either do quarter year decade century or even millennium”

    * **Key Idea:** `TO_CHAR` provides more flexible date formatting options using various format codes.

    6. **Introduction to Window Functions:** A significant portion of the excerpts is dedicated to introducing window functions. These functions perform calculations across a set of rows that are related to the current row, without collapsing the rows like `GROUP BY`.

    * **Key Idea:** Window functions use the `OVER()` clause to define the “window” of rows for the calculation. `PARTITION BY` divides the data into groups, and `ORDER BY` orders the rows within each partition.

    * **Quote:** “they let you perform calculations across a set of tables related to the current row…and like we showed they don’t group the results into a single output row this is very beneficial as we’re going to demonstrate some future exercises”

    * **Key Idea:** Examples include calculating running totals, ranks (`ROW_NUMBER`, `RANK`, `DENSE_RANK`), and moving averages using frame clauses (`ROWS BETWEEN …`).

    7. **Lag and Lead Functions:** The excerpts introduce `LAG` and `LEAD` functions, which allow accessing data from previous or subsequent rows within a window partition. This is useful for calculating differences or growth rates over time.

    * **Key Idea:** `LAG(column, offset, default)` retrieves a value from a row `offset` rows before the current row. `LEAD` works similarly for subsequent rows.

    * **Quote:** “these type of things in a window function allow us instead of looking at the current row to allow us to look at things like the row above it or the row below it”

    8. **Frame Clauses in Window Functions:** The sources explain how frame clauses (`ROWS BETWEEN`) within the `OVER()` clause can further define the set of rows to be considered for a window function calculation (e.g., a moving average over the preceding three months). `CURRENT ROW`, `PRECEDING`, and `FOLLOWING` are key keywords.

    * **Key Idea:** Frame clauses allow for flexible calculations based on a sliding window of rows.

    * **Quote:** “this allows us to specify a physical offset from the current row such as maybe the three preceding rows or maybe the two following rows”

    9. **Setting Up a Local PostgreSQL Environment:** The later excerpts transition to setting up a local PostgreSQL database, including installing the server and pgAdmin (a GUI administration tool).

    * **Key Idea:** Having a local database environment allows for more hands-on practice and development without relying on remote systems.

    * **Key Idea:** pgAdmin is used for managing the PostgreSQL server, creating databases, running queries, and exploring the database schema.

    10. **Introducing DBeaver as an Alternative Database Tool:** DBeaver is introduced as a more versatile database management tool that can connect to various database systems, unlike pgAdmin which is specific to PostgreSQL.

    * **Key Idea:** DBeaver offers a more comprehensive set of features for database development and administration, including project management, SQL editing enhancements (auto-completion, formatting), and data export/visualization capabilities.

    * **Quote:** “dbeaver now this is a database management tool so can only connect to different [types of databases]”

    11. **Project Management in DBeaver:** The excerpts demonstrate how to create projects in DBeaver to organize SQL scripts, bookmarks, and other related files. This helps in structuring database development work.

    12. **Introduction to Views:** Views are introduced as virtual tables that represent the result of a stored query. They simplify complex queries and provide a level of abstraction over the underlying tables.

    * **Key Idea:** Views are created using the `CREATE VIEW` statement and can be queried like regular tables.

    * **Quote:** “it’s a virtual table that allows us to show the results of a stored query in it”

    13. **Introduction to VS Code for Project Development:** VS Code is presented as a powerful code editor, not just for SQL but also for creating and managing project documentation (like README files in Markdown). Its preview capabilities for Markdown are highlighted.

    * **Key Idea:** VS Code, with its extensions, provides a robust environment for both code (SQL) and documentation.

    14. **Query Optimization with `EXPLAIN` and `EXPLAIN ANALYZE`:** The final excerpts touch upon basic query optimization by introducing the `EXPLAIN` and `EXPLAIN ANALYZE` commands, which provide insights into the query execution plan and performance.

    * **Key Idea:** `EXPLAIN` shows the planned steps the database will take to execute a query. `EXPLAIN ANALYZE` actually executes the query and provides timing information.

    * **Quote:** “explain demonstrates the execution plan without actually executing it whereas explain analyze basically means like it’s going to analyze it and it actually does execute it”

    **Most Important Ideas and Facts:**

    * Jupyter Notebooks can seamlessly integrate with SQL using magic commands like `%sql`.

    * The Contoso database is the central dataset for learning SQL concepts.

    * Understanding table relationships is crucial for effective querying.

    * `CASE WHEN` statements are essential for conditional logic and data pivoting.

    * Window functions provide powerful analytical capabilities without collapsing rows, enabling calculations like running totals, rankings, and moving averages.

    * `LAG` and `LEAD` functions allow for comparisons between rows.

    * Frame clauses in window functions define the scope of rows for calculations.

    * Setting up a local PostgreSQL environment with pgAdmin and DBeaver provides a robust platform for database learning and project development.

    * DBeaver is a versatile database tool supporting multiple database systems.

    * Views simplify queries and provide abstraction.

    * VS Code is a valuable tool for both SQL development and project documentation (using Markdown).

    * `EXPLAIN` and `EXPLAIN ANALYZE` are used to understand and optimize SQL query execution.

    These excerpts lay a comprehensive foundation for learning intermediate SQL concepts, ranging from basic query structures and database exploration to advanced analytical functions and development environment setup. The progression through Jupyter Notebooks to local database tools like PostgreSQL and DBeaver indicates a move towards more practical and real-world database interaction and project management.

    Exploring Data with SQL Magic and the Contoso Database

    1. What is the purpose of the %sql magic command in the provided context?

    The %sql magic command is essential for executing SQL queries within the environment (likely a Jupyter Notebook). When placed at the beginning of a cell or line, it signals to the interpreter that the subsequent text should be treated as a SQL query to be run against the connected database. Without this command, the SQL syntax would be misinterpreted, leading to errors. Using two percent signs (%%sql) applies the command to the entire cell, while a single percent sign (%sql) applies it only to the current line.

    2. Beyond SQL, what other types of “magic commands” are mentioned and what is their general function?

    The text mentions that %sql is not the only magic command available. It specifically highlights the %timeit magic command as an example. The general function of these magic commands is to provide additional functionalities and tools within the coding environment, such as measuring the execution time of code (%timeit) or facilitating interaction with external systems or specific languages (like SQL with %sql).

    3. What is the Contoso database and what are some of the key tables within it that are explored in the excerpts?

    The Contoso database is the primary dataset used throughout the lessons. The excerpts introduce and explore several key tables: – Sales: This table contains transactional data, including information about orders, quantities, net prices (prices after discounts), and order dates. It’s central to calculating revenue. – Customer: This table holds information about customers, such as their given name, surname, country, continent, and customer key. – Product: This table contains details about the products being sold, including product key, product name, category name, and subcategory name. – Date: This table contains various date-related attributes that can be used for aggregation and filtering based on dates, although the course later emphasizes using date functions instead of relying solely on this table.

    4. How is “net revenue” calculated within the context of the Contoso database, and why is it considered important?

    Net revenue is calculated by multiplying the quantity of a product sold by its net price (the price after all discounts and adjustments) and the exchange rate. It is considered important because it represents the actual revenue received from customers after accounting for discounts and promotions, reflecting the true value of sales transactions.

    5. What is “pivoting data” as described in the excerpts, and how is it achieved using SQL?

    Pivoting data involves transforming rows into columns. The example provided shows how to take customer counts grouped by continent (originally in rows) and restructure the output to have each continent as a separate column displaying the total customer count for that continent. This is achieved using aggregate functions (like COUNT DISTINCT) combined with CASE WHEN statements to conditionally assign values to the new columns based on the continent.

    6. What is the purpose of the DATE_TRUNC and EXTRACT functions when working with dates, and how do they differ from the TO_CHAR function?

    • DATE_TRUNC is used to truncate a date to a specified level of precision, such as month, quarter, or year. It returns a timestamp or date with the less significant parts set to the beginning of the time period (e.g., the first day of the month).
    • EXTRACT is used to retrieve a specific component from a date or timestamp, such as the year, month, or day. It returns a numeric value representing that part.
    • TO_CHAR is used to format a date or timestamp as a text string according to a specified pattern. This allows for flexible output formats, such as extracting the month name or formatting the date in a particular way.

    While DATE_TRUNC and EXTRACT help in manipulating and retrieving date parts for analysis or grouping, TO_CHAR focuses on presenting date information in a desired textual format.

    7. What are “window functions” and how do they differ from standard SQL aggregate functions? What are some examples of window functions discussed in the excerpts?

    Window functions perform calculations across a set of rows that are related to the current row, without collapsing the rows into a single output row like standard aggregate functions (e.g., SUM, COUNT, AVG) do. They allow you to access and perform calculations on a “window” of data defined by a PARTITION BY clause (dividing the data into groups) and an ORDER BY clause (specifying the order within each partition).

    Examples of window functions discussed include: – Aggregate functions used as window functions (e.g., AVG() OVER (…)). – Ranking functions (ROW_NUMBER(), RANK(), DENSE_RANK()). – Value functions (FIRST_VALUE(), LAST_VALUE(), NTH_VALUE(), LAG(), LEAD()). – Percentile functions (PERCENTILE_CONT()).

    8. What is “cohort analysis” as demonstrated in the excerpts, and what kind of insights can it provide about customer behavior?

    Cohort analysis involves grouping users or customers based on a shared characteristic, typically the time they acquired a product or service (their “cohort year” or “first purchase date”). It then tracks their behavior over time. The excerpts demonstrate cohort analysis by examining how different cohorts of customers contribute to total revenue and customer retention in subsequent years. This can provide insights into customer lifetime value, the effectiveness of acquisition strategies over time, and customer churn patterns by showing how engagement and spending change for different initial groups of customers.

    SQL for Data Exploration: A Course Overview

    Based on the sources, data exploration within the context of this SQL course for data analytics appears to be a crucial initial step involving understanding the structure and content of a database using SQL queries and visualization techniques.

    Here’s a breakdown of data exploration as presented in the course:

    • Understanding the Database Structure: The course emphasizes the importance of getting familiar with the database schema, which includes identifying the different tables and understanding how they relate to each other. The Entity Relationship Diagram (ERD) of the “cantazo” database is introduced as a tool to visualize these relationships, particularly how dimensional tables like store, product, and customer relate to the main fact table, sales.
    • Examining Tables and Columns: Data exploration involves inspecting individual tables to understand the columns they contain and the types of information stored in them. This is demonstrated by using SQL queries like SELECT * FROM sales LIMIT 10 to view the first few rows and identify the available columns such as dates, customer key, store key, product key, quantity, price, cost, and currency information.
    • Exploring Metadata: The course also shows how to query the information schema, a meta-database, to discover the tables within a database and the columns within each table. Specifically, it demonstrates using SELECT table_name FROM information_schema.tables WHERE table_schema = ‘public’; to list the tables and SELECT * FROM information_schema.columns WHERE table_name = ‘customer’; to see all the column names in the customer table.
    • Using SQL for Initial Analysis: Simple SQL queries are used to get a first look at the data and its characteristics. For example, selecting all columns from a table and limiting the number of rows allows for a quick overview of the data’s format and values.
    • Leveraging Tools for Exploration: The course utilizes Google Colab in the first half, which allows for running SQL queries and provides features like converting query results to interactive tables and generating visualizations. The integration with Gemini AI is also mentioned as a way to assist in generating SQL queries for data exploration. Later in the course, pgAdmin and DBeaver are introduced as more advanced database tools that allow for visual exploration of the schema, tables, and data. DBeaver, in particular, is highlighted for its ability to view table data in a grid format and examine ER diagrams.
    • Understanding Data Relationships through Joins: As part of data exploration, the course demonstrates how to use JOIN clauses (specifically LEFT JOIN) to combine data from multiple related tables, such as sales, customer, and product, to understand how different entities interact and to bring together relevant attributes for analysis.
    • Identifying Key Fields: The initial exploration helps in identifying key fields that will be important for further analysis, such as the different keys used to relate tables and the metrics (e.g., quantity, net price) available for calculations like net revenue.

    In essence, data exploration in this course lays the foundation for more advanced data analytics by ensuring a solid understanding of the available data, its structure, and its basic characteristics through the use of SQL and database exploration tools. This initial phase is crucial for formulating meaningful analytical questions and developing effective SQL queries for deeper insights.

    SQL for Time Series: Date Calculations

    Based on the sources, this SQL course includes a chapter dedicated to date calculations, emphasizing their importance for time series analysis. The course covers several key date and time functions and keywords:

    • DATE_TRUNC() Function: This function allows you to truncate a timestamp down to a specified level of precision, such as year, quarter, month, week, day, hour, etc.. For example, DATE_TRUNC(‘month’, order_date) extracts the month and year from the order_date. The output of DATE_TRUNC() is a timestamp, which can be cast to a DATE data type if needed using the ::date operator.
    • TO_CHAR() Function: This function provides a flexible way to format date and time values into strings based on various format patterns. You provide a timestamp or date and a format string to specify the desired output. For instance, TO_CHAR(order_date, ‘YYYY’) extracts the year, and TO_CHAR(order_date, ‘MM-YYYY’) extracts the month and year in the specified format. TO_CHAR() offers more customization compared to DATE_TRUNC() as you can combine different components in your desired order.
    • DATE_PART() Function: This function extracts a specific component from a date or timestamp, such as year, month, day, hour, minute, second, etc.. The syntax involves specifying the part you want (e.g., ‘year’, ‘month’) as a string and then the source date or timestamp. For example, DATE_PART(‘year’, order_date) extracts the year. The source mentions that the output might include decimals, which might not always be desirable.
    • EXTRACT() Function: Similar to DATE_PART(), EXTRACT() also retrieves a specific component from a date or timestamp. However, the syntax is slightly different: you specify the part (e.g., YEAR, MONTH, DAY) as an uppercase identifier followed by the keyword FROM and the date or timestamp. For example, EXTRACT(YEAR FROM order_date) extracts the year. The course prefers EXTRACT() over DATE_PART() for components like year, month, and day as it typically returns integer values without unnecessary precision.
    • CURRENT_DATE: This keyword returns the current date at the time the query is executed, based on the server’s time zone or a specified time zone (though specifying a time zone is optional if the default is sufficient).
    • NOW(): This function returns the current date and time (timestamp) at the moment the query is executed.
    • INTERVAL Keyword: The INTERVAL keyword is used to represent a span of time, which can be defined in units like days, months, years, hours, etc.. You can create an interval by using the keyword INTERVAL followed by a value and a unit (e.g., INTERVAL ‘5 year’, INTERVAL ‘6 month’). Intervals can be added to or subtracted from dates and timestamps for date arithmetic.
    • AGE() Function: This function calculates the difference between two timestamps or dates, returning the result as an interval. The order of the dates matters; AGE(end_date, start_date) will yield a positive interval. You can then extract specific components from the resulting interval, such as the number of days, using the EXTRACT() function.

    The course also demonstrates how these date functions are used in conjunction with other SQL clauses for analysis:

    • Filtering Dates with WHERE Clause: Date functions are commonly used in the WHERE clause to filter data based on specific date ranges or conditions. Examples include filtering orders within a specific year using EXTRACT(YEAR FROM order_date) = 2023 or finding orders within the last 5 years using order_date >= current_date – INTERVAL ‘5 year’.
    • Grouping by Date Components with GROUP BY Clause: Functions like DATE_TRUNC() or TO_CHAR() are useful for grouping data by specific time periods, such as monthly sales by grouping on DATE_TRUNC(‘month’, order_date) or TO_CHAR(order_date, ‘MM-YYYY’).
    • Ordering by Date with ORDER BY Clause: Dates can be used in the ORDER BY clause to sort results chronologically.
    • Calculating Time Differences: The AGE() function is used to calculate the duration between events, like the processing time of an order by finding the age between the order date and the delivery date.

    The importance of dynamic filtering using functions like CURRENT_DATE and INTERVAL is highlighted, as it allows for creating queries that automatically adjust based on the current time, such as always retrieving data for the last 5 years.

    SQL Course: Calculating Net Revenue

    Based on the sources, net revenue in this SQL course is consistently calculated by taking into account the quantity of items sold, the net price of each item, and the exchange rate if currency conversion is needed.

    Here’s a breakdown of the net revenue calculation process as described in the sources:

    • Net Price Defined: The net price is the price that the customer actually pays for a product after all applicable discounts, promotions, or adjustments have been applied. It is explicitly stated that the net price is less than the unit price due to these reductions.
    • Basic Calculation: The fundamental way to calculate the net revenue for a particular transaction is by multiplying the quantity of the product purchased by its net price. This can be represented as:
    • Net Revenue = Quantity * Net Price
    • Incorporating Exchange Rates: In the context of the “cantazo” database used in the course, the transactions may involve different currencies (e.g., pounds and US dollars). To standardize the revenue in a common currency (like US dollars, as the instructor prefers), an exchange rate is applied. The complete formula for net revenue used throughout the course is:
    • Net Revenue = Quantity * Net Price * Exchange Rate
    • This formula is used in various lessons when calculating total revenue, revenue by category, customer lifetime value, and for cohort analysis.
    • Example of Currency Conversion: Source provides an example where revenue figures are initially in pounds and then converted to US dollars by multiplying by an appropriate exchange rate.
    • Application in SQL Queries: The course demonstrates the use of this net revenue calculation within SQL SELECT statements, often with the result being aliased as net_revenue or total_net_revenue. This calculation is then used in aggregations with the SUM() function to find total revenues for different groupings of data (e.g., by order date, category, customer cohort).

    Therefore, to calculate net revenue in the context of this course and the “cantazo” database, you generally need to multiply the quantity of the products sold by their respective net prices, and then adjust for currency differences by multiplying by the relevant exchange rate. The course emphasizes that the net price already reflects any discounts or adjustments, representing the actual amount charged to the customer.

    Customer Segmentation: A Data-Driven Approach

    Based on the sources, customer segmentation is a key data analytics concept that involves dividing customers into distinct groups based on shared characteristics or behaviors. The goal of customer segmentation is to enable more targeted analysis and tailored strategies for different customer groups.

    Here are the main aspects of customer segmentation discussed in the sources:

    • Definition: Customer segmentation involves taking large datasets and breaking them down into smaller, more manageable pieces to analyze different behaviors within those groups. This allows for a deeper understanding of customer behavior and facilitates more effective decision-making.
    • Methods using CASE WHEN Statements: The course emphasizes using the CASE WHEN statement as a fundamental tool for customer segmentation. This allows for the creation of new columns that categorize customers based on specified conditions.
    • Simple Binary Segmentation: Customers can be segmented into two groups, such as “high value” and “low value,” based on a single criterion like net revenue threshold (e.g., above or below $1,000).
    • Segmentation by Multiple Conditions: More advanced segmentation can involve multiple conditions using the AND operator within a CASE WHEN statement. For example, segmenting customers based on a combination of the year of purchase and whether their net revenue is above or below the median.
    • Segmentation into Multiple Tiers: Customers can be divided into more than two segments (e.g., low, medium, high value) using multiple WHEN clauses within a single CASE block. This allows for a more granular understanding of customer value.
    • Segmentation based on Net Revenue and Spending: A primary method for customer segmentation in the course involves analyzing customer spending, often using net revenue as the key metric.
    • Segmentation using Percentiles: The course demonstrates segmenting customers into tiers (low value, mid value, high value) based on their total lifetime value (LTV) using percentiles (25th and 75th). Customers falling below the 25th percentile are categorized as “low value,” those between the 25th and 75th percentiles as “mid value,” and those above the 75th percentile as “high value”.
    • Purpose of Segmentation: The primary goals of customer segmentation highlighted in the sources include:
    • Identifying Valuable Customers: Understanding who the most valuable customers are based on their spending or LTV.
    • Targeted Marketing: Enabling businesses to target specific customer groups with marketing campaigns tailored to their needs and value.
    • Analyzing Group Behavior: Facilitating the analysis of different customer groups to understand their spending habits, retention rates, and other key behaviors.
    • Developing Business Strategies: Providing insights that can inform business decisions and strategies for customer engagement, retention, and growth.
    • Implementation in SQL: The process of customer segmentation typically involves:
    1. Calculating relevant metrics like total net revenue or lifetime value.
    2. Using CASE WHEN statements to create a new column that assigns customers to different segments based on defined criteria.
    3. Aggregating data by these segments to analyze their characteristics, such as total revenue contribution, customer count, and average value.

    In summary, customer segmentation as taught in this course is a crucial analytical technique leveraging SQL’s conditional logic and aggregate functions to categorize customers based on their value and behavior. This process allows for a more nuanced understanding of the customer base and enables businesses to implement more effective and targeted strategies.

    SQL Query Optimization: Techniques and Analysis

    Based on the sources, query optimization is an important topic covered in the second half of this SQL course. The goal of query optimization is to improve the performance and efficiency of SQL queries, making them run faster and consume fewer resources.

    Here’s a breakdown of the key aspects of query optimization discussed in the sources:

    • Understanding the Execution Plan with EXPLAIN: The course emphasizes the use of the EXPLAIN keyword to understand how the database plans to execute a query.
    • EXPLAIN shows the execution plan without actually running the query. This plan details the steps the database will take, such as table scans and joins.
    • EXPLAIN ANALYZE goes a step further by executing the query and providing actual execution times, planning time, estimated costs, number of rows processed, and other statistics. This allows for a more precise understanding of query performance. The output of EXPLAIN ANALYZE can help identify bottlenecks in a query.
    • The output of EXPLAIN and EXPLAIN ANALYZE includes details about the cost (an arbitrary unit assigned by PostgreSQL), the estimated number of rows, and the width (row size in bytes) for each step in the execution plan.
    • DBaver provides a feature to visualize the execution plan, offering another way to understand the query execution flow.
    • Query Optimization Techniques: The course covers various techniques to optimize SQL queries, categorized as beginner, intermediate, and advanced.
    • Beginner Techniques:
    • Using LIMIT: Employing the LIMIT clause to restrict the number of rows returned can significantly reduce query execution time, especially when dealing with large tables and only a subset of data is needed. The sources demonstrate a substantial decrease in execution time when LIMIT is used.
    • Being Selective with Columns (SELECT Specific Columns vs. SELECT *): While PostgreSQL might sometimes efficiently retrieve data using SELECT *, it’s generally recommended to select only the specific columns required for the analysis. This practice can be more efficient, particularly in large databases, by reducing the amount of data that needs to be processed and transferred.
    • Using WHERE Instead of HAVING: Filtering data using the WHERE clause before aggregation (with GROUP BY) is generally more efficient than filtering the aggregated results with the HAVING clause. The WHERE clause reduces the number of rows that need to be processed by the aggregation step.
    • Intermediate Techniques:
    • Minimizing GROUP BY Operations: Reducing the number of columns in the GROUP BY clause, especially if grouping by columns that have repeating values within the context of the aggregation, can lead to performance improvements. The course demonstrates that removing an unnecessary column from the GROUP BY clause can decrease execution time. In cases where seemingly redundant GROUP BY columns exist, using aggregation functions like MAX() on those columns can help minimize the GROUP BY while preserving the desired information.
    • Reducing JOIN Operations: Minimizing the number of JOIN operations and considering the type of JOIN used can impact query performance. Using functions to extract necessary information instead of joining tables can sometimes be more efficient. Additionally, using more specific join types like INNER JOIN when all matching records are expected in both tables can be slightly more performant than less restrictive joins like LEFT JOIN.
    • Optimizing ORDER BY Clauses: Optimizing the ORDER BY clause involves limiting the number of columns being sorted, avoiding sorting on computed columns or function calls if possible, and sorting by the most selective columns first. Utilizing database indexes for sorting can also be beneficial, although index management is typically handled by database administrators.
    • Advanced Techniques:
    • These include using proper data types, implementing indexing on frequently queried columns to speed up data retrieval, and employing table partitioning for very large tables to improve query performance. These techniques are often managed at the database administration level.

    By understanding and applying these query optimization techniques, along with analyzing the execution plan provided by EXPLAIN and EXPLAIN ANALYZE, users can write more efficient SQL queries that perform better, especially when working with large datasets.

    SQL for Data Analytics – Full Intermediate Course

    The Original Text

    data nerds welcome to this full course tutorial on intermediate sql for data analytics this is the course for those that understand the basics of sql but want to take it to the next level perfect for those that took my first course on this now to master this tool we’ll break down more advanced sql concepts in short 10-minute lessons during this you’re going to be working right alongside me completing realworld exercises following each lesson you’ll have the option to do interview level practice problems to not only prep you for the job but also reinforce your learnings and by the end of the course we’ll have used sql to build a fully customizable portfolio project that you can share to demonstrate your experience now sql is by far the most popular tool of data analysts for those in the united states it’s the top skill that’s in almost half of all job postings for data analysts and this only increases in demand for senior data analyst roles coming in at two out of every three job postings now in related data jobs like data engineers it’s almost the same appearing in two of three job postings and for data scientists it’s in almost half now sql or sql is the language used to communicate between you and a database it’s my mostused tool as data analyst starting with my first job working for a global fortune 500 company and even to my most recent role working with mr beast yeah even jimmy uses it this tool is so imperative i use it all the time with python excel powerbi and tableau to connect to my databases so over the years i’ve been cataloging everything i found helpful with using this tool and i put it all into this course now you’re probably wondering who is this course for well if you’re unfortunate to take my first course here’s some items you should definitely know keywords used for data retrieval functions and keywords used for aggregations and grouping the different types of joins and also unions keywords used for logic and conditions along with date manipulation data schema control and finally subqueries and cte now as far as the math required to take the course if you have a secondary education such as high school in the united states you have the requisite knowledge to take this we’re going to be doing at most just some basic algebra and statistics now let’s get into the course structure we’re going to be breaking this down into two halves in the first half we’ll have an intro that will get you set up and comfortable with the database we’ll be using throughout the entire course next we’ll jump right in pivoting data using case statements we’ll be transforming and analyzing data using aggregation and also statistical methods then we’re going to get into intermediate date and time functions because frankly you can’t get away from date and time data in databases we’ll then wrap up the first half covering window functions the most requested topic i’ve gotten by far on this covering basic and also complex aggregations now for the second half of the course we’re going to shift gears we’re going to not only install postgress on your machine but also we’re now going to be working in these lessons to building our portfolio project we’ll start by setting up the database locally and installing a top editor for running sql queries with this environment set up we’ll build our first view and this will actually help us solve our first portfolio problem after this we’ll transition into learning the most popular functions to transform messy data to solve our second portfolio project problem and then we’ll wrap all this up with query optimization understanding how to use keywords like explain to optimize queries so by the end of this you’ll have a real world project to showcase your newfound skills and demonstrate your experience now i’m a firm believer in open sourcing education and making it accessible to everyone so this course is completely free i’ve linked all the resources you need below including the sql environment and all the different files you need to run the queries remotely and locally oh and also include my final project that you can also model after now unfortunately youtube isn’t paying the bills like it used to so i have an option for those that want to contribute and thus help support fund more future tutorials like this for those that use the link below to contribute you’re going to get some more additional resources specifically after each lesson you’re going to have access to interview level sql problems it will not only reinforce your learnings but also prep you for job interviews in here you’re going to get community access to be able to ask any questions to fellow students along with access to the queries and notes behind each lesson so you can follow right along as i go through it finally at the end you’ll receive a certificate of completion that you can share to linkedin now for those that have bought those supporter resources you’re going to continue watching the course here on youtube but then you can go to my site to actually work through all these problems access to the notes and access to community all right we’re about to get into the first lesson before we do that i want to cover some common questions and answers specifically we’re going to start with this one first what database are we even using well every year stack overflow interviews a bunch of nerds to find out what are their top technologies that they’re using and 50,000 chose that postgress was their top option to use and to use over the coming year and according to this visual it’s not only the most admired it’s the most desired to learn database so for this course we’re going to be using and learning with postgress now that we know the database how the heck are we going to be running these sql commands well as i mentioned previously this course is broken into two halves and we’re going to be using an option for the first half that gets you up and running quick specifically we’re going to be using google collab which is a free option and it allows us to have an environment that we can not only load the database in but also query it i’ve linked this notebook below and includes all the code necessary to install this database and get into querying it now don’t worry if you haven’t used collab before i’m going to break it down all in the next lesson which for those that bought the course perks you’re going to get access to these lessons which are in a jupyter notebook for the second half of the course we’re going to shift gears and we’re going to install postgress locally on your computer and run all the queries from there we’re going to get you set up with pg admin which is postgress’s custom guey in order to interact with databases but from there we’re going to get you set up with the most popular database editor dbaver which is used by over 8 million users and this is where we’re going to be running our queries and i like this editor because it’s not only free but it also connects to a host of different databases so whatever you use and learn in this course with this editor you can apply to other databases now that we know the database and the editor what data set are we going to be using for this well i present to you cantazo and this is a data set created by microsoft used to imitate real business data jumping back into dbaver we can see the erd or entity relationship diagram and this shows how the data set revolves around sales data that’s the fact table and then we also have four dimensional tables that relate to it this is going to be great for analyzing business transactions in a real world scenario we’re going to go over everything you need to know for this after our google collab lesson now that we got that out of the way let’s get into some resources you have available starting with those that have decided to support the course first i’m going to walk you through how to get access to the course notes which detail all the different topics and code that i use within each lesson and next you’re going to have access inside of the course platform to interview level sql practice problems after each lesson i’m going to provide you with a bunch of different practice problems that range in difficulty for you to go through and test your skills if you get stuck feel free to jump in the comment section below and talk with other students in the course speaking of help how the heck do you get help in this course well you could jump into the youtube comment section and hope somebody comes and actually answers your question or you can get a really quick answer going to a reputable chatbot like chatgbt i use this bad boy all the time with my coding issues and it gets you an answer quick all right next question well isn’t really a question it’s more of a statement people tell me all the time luke this video is too long i can’t navigate it well unfortunately i think you don’t know how first of all i include chapter markers for all the different lessons throughout this the next is keyboard shortcuts i like to use j and k in order to jump forward or backwards 10 seconds and then finally if you need more precise navigation you can just click and drag up on the navigation bar of the video itself and then you can do precise seeking pretty cool all right last question who helped build this course and i’d be remiss if i didn’t give a shout out to kelly adams she was the brains behind putting together the lesson and also a lot of the practice problems for this this course wouldn’t have been possible without the help of her all right let’s get into the first lesson all right in this lesson we’re going to be going over how we’re going to be running sql queries in this first half of the course using google collab which is a type of jupyter notebook so link below is a blank notebook and opening up it’s not fully blank but it’s blank enough to actually get started with writing sql queries let’s do a quick demo of how we’re going to use this for sql queries first i need to run this cell up top and it’s going to give me this warning that hey this notebook was not authored by google it’s fine it’s run anyway it’s from me you can trust me it should take about 40 to 50 seconds to run this cell which we’ll go through more later in this video basically it’s loading the database and getting it set up for us to actually use and now run sql commands so inside this code cell let’s provide a command so we’re going to begin by writing our command underneath this percent sql syntax right here at the top and i’ll provide this query looking into the sales table looking at those top 10 options i can run it by pressing this play button or pressing shift enter in less than a second i have all the different results pictured below if i want to run another cell i can just come underneath it click code make sure that i add that percent sequel to the top of the cell it’s not going to work otherwise and then run my next command that i want to right underneath this all right with that out of the way we’re going to now dive deeper into understanding what is google collab what are jupyter notebooks how to actually use these to run sql queries and what the heck is going on with all that code that i had in those cells now if you have familiarity with using google collab already or already confident in using jupyter notebooks and you feel like any of this that i’m going to cover is not relevant to you it’s fine go ahead skip to the next lesson this is more focused on those that don’t have any background with using jupyter notebooks so let’s start with jupyter notebooks here i have a jupyter notebook of this actual lesson inside of vs code don’t worry you don’t need to actually open inside vs code just showing this for demonstration purposes now personally i love jupyter notebooks for performing analysis because not only can i have these text cells like are pictured right here and then scrolling down even further i can see that at the bottom i have a sql cell along with the sql output so i love these because i can use sql to extract out and analyze the data i need and then if needed use something like python to visualize it now moving into google collab which you can see right here i’m inside my web browser and this is that same exact file that i had inside of vs code but now it’s here inside of google collab and similarly it has that same functionality where i can write python code in cells along with using that sql and the outputs of that below it i really like google collab because it makes it super easy to share and collaborate with others this isn’t just a static document i can come in here and actually run all the different cells inside of this notebook and if somebody wanted to they could come in and modify this query further so right now this one’s only looking at years let’s say we wanted to look at the actual total revenue i could just add this line in the command run it and get the results right below so super easy to collaborate with others and now you may be wondering why are we actually using collab for running these sql commands well basically this code right here that i have in this cell that we’re going to cover i promise allows us to load in our database and for you to have access to the database immediately without having to actually install it locally on your own computer so basically we can get up and running with running all these different sql commands really quickly let’s start with a blank notebook to walk through this process of understanding how to use notebooks if you navigate to collab.resarch.google.com this is where we’re going to start a new notebook and it will have prompted you to log into google at this point anyway go and click this so it starts this new one which gives the title untitled zero you can go up up here and actually change it and i’ll change it something like collab 101 quick overview before diving into the center portion right here we have a typical menu up at the top to do a bunch of different options and then we also have this sidebar over to the left hand side that gives us a lot of different options as well in the center here is where the actual notebook is itself and i can do things like either add a code cell or a text cell if i wanted to i could type into it this is a text cell i can also change the formatting of it by highlighting it and toggling it to being a heading they also have multiple other options available as well whenever i’m done with this all i have to do is press shift enter and whenever i press shift enter it then starts another cell a coding cell below that now in collab these are exclusively python cells we have to do some magic if you will in order to get it to run sql but you may not realize it but you actually know some python even if you don’t know it i could do something like 2 + 2 press shift enter and what’s going to happen here is it’s going to run the cell of 2 plus 2 and then we get the results of four now if i don’t need certain cells like this one up at the top i just click into it and click the trash can similarly i can do it to this one down below now let’s go over these menus and for this i’m going to be demoing it using the actual lesson plan notebook from this cuz it makes it more interactive to show actually the capabilities of it anyway over on the lefth hand side if i click this over on the left i have the table of contents based on how i formatted all the different lesson notes i can actually scroll through and see all the relevant topics if i wanted to find something i can just go in here type fine of markdown and as expected take me to all the different markdown things inside of here we also have other things like variables secrets and also files that’s more in depth if you’re using python for this you won’t really need to use that along with these three at the bottom also not going to be using as much in this sql course now up at the top in the file menu right here file edit view insert everything like that’s normal runtime is the one location that i find i’m actually using the most and find the most important anytime i’m opening a notebook i’m going to be doing this run all and i can also see that i can do this with the shortcut of command f9 and this will go through and actually run all the cells down here at the bottom it gives you a status update of what’s going on along with the time it’s taking so far now scrolling through all these different cells i can see they all executed properly but sometimes we run into bugs and they’re not running properly in that case we can come up here into runtime and i recommend running this of just restart session and run all it will prompt you if you need if you really want to do this and yes go ahead and do it basically clear everything out and run it again that’s only if you’re having problems you shouldn’t but if you did now you know in this last section of the lesson let’s understand what is going on with how we’re running sql queries inside of this notebook for this i want you to actually open up that blank sql notebook and load it into your window and you can follow along with me if you haven’t done it already go ahead and up at runtime click run all so i mentioned earlier all of this code here which is in python goes through and actually installs and sets up your database we’re going to walk through it really quickly but the important thing to understand here is not necessarily the code or that you need to code it yourself it’s mainly understand what’s going on behind the scenes first it goes in and imports some important libraries that we need for this next if it’s in collab which we’re in we go in and install postgress so postgress is actually running inside of this environment that we’re inside of it goes through and sets up a user a password and then from there actually installs the database itself which you can get at this link right here from there we import in a sql library in order to be able to run sql commands specifically it’s called gps sql and then with this jubsql we go ahead and load the extension actually connect to this database that we loaded in from up above and do some other fancy things that help us get formatting and everything else set up properly so similar before below this magic command of percent sql i can write a sql query as i’m writing these sql commands you should have autocomplete come up so in this case i have select if i want to use it all i have to do is press tab and then once i have everything i need there once again i can press shift enter now that magic command is really important if i were to copy this paste it below one i get all of this highlighting saying that it’s misspelled and that they have syntax errors and then two when i actually try to run it i get actual syntax errors so very important that you put these magic commands up at the top now so people don’t think i’m crazy magic commands are the actual official language of this and we’re not only limited to sql magic commands they also have a host of other ones let’s say i want to use this one of time it where it measures the execution time of the next line of code i could type the magic command of percent time it some help pops up of what actually is going on with this module that we’re actually using here which is pretty actually useful and then underneath it i can put some python in here i’ll just do something simple like 2 plus 2 running this pressing shift enter we can see that this special command provides the time of this took 9.93 nanconds now with these magic commands you can also use just one percent sign and that means it applies only to the line that it’s currently on so in this case i could do 2 + 2 on this line press shift enter it’s still going to run it which in this case looks like it’s a little bit faster but if i actually had only one percent sign and let’s say this is on another line pressing shift enter it’s just going to output the four and it’s not going to actually time it it’s not until i actually use two of the percent times and run it that it will actually time it and now we’re back up to 9.85 nconds so i’m just reinforcing this because it’s very important that you remember to do that percent sql before any sql command now if you’re nerd like me and you want to dive deeper into the documentation of jubsql because of the brains behind that sql magic man you can a link below all right for those that purchased the supported resources you now have some practice problems to go through and get more familiar with how to use jupyter notebooks and sql queries together in the next lesson we’re going to be diving deeper into the database to understand all the different tables and what comes along with it with that i’ll see you in the next one in this lesson we’re going to be getting an intro into the database that we’re going to be using for the entirety of this course specifically the contazo database for this we’re not only going to explore why we’re using this data set but also the components about it exploring all the different tables using things like the erd or entity relationship diagram now we’re going to use this lesson as a warm-up to get ready to get into using intermediate sql so during the course of this i will be covering different past topics that you should know in order to get you up to speed as fast as possible if you haven’t used sql in a while by the end of this we’re going to be covering a query from scratch in order to dive in to the most popular tables while using google’s collab and some additional ai features to speed up your workflow now the contazo database that we’re going to be using for this is based off of a data set from microsoft which they’ve been using for years whenever they launch any products specifically sql products in order for you to explore how to use the functionality of it anyway this database is really robust because it contains a lot of different information in it such as sales transactions product information store details and even date and time data and this database is great because it allows us not only to explore all these different intermediate sql topics that we’re going to be using for this but also it’s based on a real world business set of data so what you’re going to learn in this course you can apply to the real world and so you may be like luke how the heck do i get this database installed well if you remember from the last video we have this python code up at the top that actually goes through and installs the database the database or the sql file for loading it it’s located at this link and we go through this script right here in order to load it in to this collab notebook which i’ve conveniently linked a blank notebook below that you’ll be able to follow along any of the lessons with so this diagram shows via these lines between all these different tables how they are actually related and there’s actually a lot of columns inside of these tables themselves so we put these ellipses at the bottom to basically signify or symbolize all the different columns that are in it so let’s get into breaking this bad boy down we have a total of six tables in this contazo database specifically our main fact table is the sales table and this contains all of our quantitative business metrics that we’re actually going to be analyzing and inspecting as we go throughout the course so it’s probably the most important table you need to know then we have four related tables or commonly known as dimensional tables these things have descriptive attributes that we can use in our analysis so for things like store we relate it using the store key and the sales and then stores table and the store database has information on well the stores similarly been said about the product and also the customer table the date table is slightly different in that it relates to the different dates specifically our order and delivery date the last table in this is that currency exchange table and it’s not related at all to our fact table and we’ll show why in a little bit now you may be wondering how can you actually go through and see what this database looks like and understand what are the tables in it well we’re going to be exploring tools later in the course specifically this is pg admin right here where i can visualize that erd and it shows how our fact table of that sales table is related to all those other different dimensional tables additionally it’s pretty nice because i have this kazo 100k and i can go into something like schemas and then down to tables and further i can further explore all these other different tables as well even looking at things like the columns for the sales table but we’re getting ahead of oursel we’ll learn how to do that in a bit i’ll teach you some shortcuts on how you can actually do something similar to this in collab so let’s get into running some queries first thing we need to do is go through and actually run all the different cells in your notebook basically get that database loaded into our environment and so we’re looking to explore what are the tables in this database that we just loaded into here i’m going to use gemini for this if it’s your first time using this ai model from google it’s going to prompt you with this privacy notice make sure you click continue and i can prompt it this of what sql query shows the tables in a database what we can do is access all these different table names by looking in information schema which is a meta database and specifically using the data attribute the looking at the tables within it which is a table now for this you can either click copy cell or you can do add code cell now remember we’re going to have all the syntax highlighting issues because we’re not or we don’t have in that magic command we need to put at the top specifically that percent sql so i’ll just copy that from here paste it up here and then run this bad boy so from this we confirm we do have six tables in our database and if i wanted to i can convert this data frame to an interactive table like this and then we also have this option to visualize it which we’ll be doing down later down the road so let’s explore first that sales table as that’s the most important part of this whole puzzle i want to see all the different columns of this so i’m going to use select and then star we’re going to do this from that sales table now if you’re noticing right now i have some autocomp completion happening right now you see i typed sales and i have this underscore fact underneath it this is the ai autocomp completion especially whenever i’m learning how to use sql i don’t find this very helpful and actually quite distracting so we can turn this off real quick if we go into open settings and select under ai assistance we can uncheck this option here for show ai powered inline completions whenever we close this we can see no longer pops up now this is good enough query as is but anytime i do a select star type thing it’s very resource inensive especially if there’s a lot of columns and rows so with this i’m going to limit this to just the first 10 rows then from there press control enter also don’t need this query off to the side so i’m going to close that so with this sales table we can see that it has all those different relations to those other tables such as the dates customer key sore key and product key from there we have information on what is actually sold in this sale specifically the quantity the price the cost and then also the currency used and its exchange rate in our example at the end of this we’ll go through calculating what is the net revenue and how we need to actually multiply or use all this together to calculate that let’s get through exploring these tables specifically we’re going to go with the easiest one first next of exchange rate if you recall our currency exchange table is in no rel way related to that sales fact table but what the heck is in it well exploring it we can actually see that in it it has a date column from currency to currency and exchange rate basically it at a specific time in history it tells you how you could convert a currency from one to another what the rate you need to use that now conveniently our sales table automatically just includes this exchange rate which was calculated from this table so technically this table is only needed if you need to go back and dive into understanding the exchange rate and how it’s trending over time all right we have four tables left and they’re all the dimensional tables that related to our sales table let’s start with store first this one is related to that sales table on a store key and then has information on where this is located such as country name of the store even the size next up is our product information and it’s related to that sales table on that product key it has information on the product specifically what is the name who’s the manufacturer how much it even weighs and what categories and subcategories it falls into next up is our customer table it’s related to that sales table on our customer key and this has a bunch of information related to the customer itself like where they’re located what their name is what their birthday is blah blah blah anyway what you notice right here in the middle is we have these ellipses and that’s because there were so many columns in this it didn’t show it here now previously whenever we were looking for the tables in the database we could run this on that metadata inside the information schema so what i’m going to do is actually i’m going to take this right here command c this and i’m going to paste it right into here but for this we don’t want to use tables we want to use columns running shift enter on this this only gives us table name information so i’m actually going to change this to select star run shift enter so we can see everything available in this query and inside of here i can see of this columns table we have a table name and column name so what i can do is i can now filter this for the table of customers so i’ll specify where table name is equal to customers running this again pressing control enter got a typo it’s customer running control enter again so now i have a way to view all the different column names and it’s not cut off and so we can see everything inside of it but not really finding anything that great right here for now but other stuff we’ll use in the future last table to explore is that date table this is related using that date column here to the sales table order date and delivery date now this table has a lot of different ways that you could aggregate all the different date data in here by looking at maybe day of week or month or year so this is great and all especially if you’re using a tool like powerbi and you want to just grab something quickly in order to filter maybe for january 2015 data but in this course we’re going to be diving deeper into using different date functions and so we’re not really going to rely on this table at all to get the data out because you won’t always have a date table available in order to investigate things so basically just ignore this bad boy now let’s wrap this lesson up by getting into an investigation of how we can use all these tables together for a common example so let’s say my boss who’s not so good at sql comes to me and wants to get some different revenue data that has information about customers and also products they’re ping purchasing and whether they’re of different high value or low value items so we’re going to walk through this example calculating the net revenue for this and how we can put this all together using all the different tables first thing we need to do is calculate net revenue so let’s look back at that sales table we’re going to use that same query as below and we get this table that we saw previously now for this how do we want to actually calculate that net revenue well in order to do this we need to use the net price now you’ll notice from this the net price is less than the unit price that’s because the net price is the price after all the different discounts promotions or any adjustments so basically it’s what we actually charge to the customer when they pay for the product and with this net price we need to multiply it times that quantity so what i’m going to do is put a comma here go to a next line and say we want to multiply the quantity times the net price and we’ll say this we’ll label this as the net revenue now when i name variables or when i name new column names i’m going to put this underscore between it i just find it easier to read the naming convention that kazo’s database is using and looks like i have a typo which is pretty good that we hit this right now because this is how i’m actually going to go through and troubleshoot this first it will tell me that there’s this runtime error anytime i’m running a query you can just ignore that you’re going to be seeing that all the time but it actually points this carrot here to where the issue is specifically it’s point to this line and it has to deal with quantity is not spelled correctly at all running this again pressing controll enter okay we actually have it now and over to the side we have that net revenue double checking this it looks like all the numbers are actually getting calculated correctly there’s one last step to do and that is we need to convert it to a common currency right now you can see that they’re using pounds here and then us dollars below basically we’re going to use it all the same i’m in america so we’re going to be using us dollars all we have to do for this is just multiply by the exchange rate gone ahead and added it in and now we can see that it is in fact adjusted for what it needs to be now we’re going to be adding customer and also product information using this customer key and product key but this table’s already getting sort of large so i want to condense it down to different columns i’m for sure i’m going to use and really the only other thing i care about is order date so we’ll go ahead and simplify this table down to this next we’ll move into our second of five steps and we want to filter for our recent sales specifically we want things from 2020 and greater for this i’m going to use a wear clause and i want this for that order date that we have in that sales table to be greater than or equal to january 1st 2020 now let’s go ahead and try to run this and it looks like it works now i would say in order to be safe if you’re ever working with date data that you’re not sure if it was converted to the date type specifically in postgress you can use this colon operator and then specify the data type you want to use for this in this case date so order date is getting converted or cast as a date this is going to work just the fine but just a tip for you all right next thing my boss wants added in is the customer info about who ordered that order now in order to do this we need to use a join and there’s four major types of joins left join right join inner join and then full outer join in our case we want to perform a left join because table a is our sales table and we want any related data to that a table in the sales table returned from that b table or customer table so let’s add this left join we’re just going to go between from and where i’ll add in a left join we want to do this on the customer table we’re going to give it the alias just c to make it easy similarly i want to give sales an alias as well i’m actually going to bring this down and then indent this over i’ll give this the alias of s now for this left join we want to do this on from the sales table we want to use that customer key and then from the customer table we want to use customer key so we’re going to use good actual naming conventions here i’m going to add that s dot to the front of order date along with the front to quantity net price and exchange rate now i’m going to go ahead and run this to see if it actually works and it looks like it works we don’t have anything from the customer table i’m going to go ahead and add in all the different columns by doing basically a c.ar notation to bring all those in all right so from this list i can see there’s a few different columns we want that my boss has told me about specifically we want to get the given name or first name surname country full and then also the continent that they’re from all right second to last step we need to add that product information in and similarly we’re going to be forming a left join with this we’ll give that product table an alias of p and we’ll be connecting it on the product key of the sales table and the product table once again i want to see everything from that product table so i’ll do p.star running this control enter i can see that we connected it properly with all the different product information once again i don’t want all the different columns associated with this only want to select few does my boss specifically these four columns of product key product name category name and subcategory name all right so looking pretty good with this only one last step to do and we’ll have all the information we need specifically we want to look at whether a customer is high value or low value looking at the net revenue we want to basically bin these customers into whether they’re spending less than $1,000 or greater greater than $1,000 in order to accomplish this we need to use a case when statement and we’re going to add it in at that last column right here so we’ll say case when we want to look at the net revenue but we can’t use an alias inside of the select statement because it’s not necessarily defined yet so we just need to take all of this below paste it in here and say greater than 1,000 and in that case we want to say that it is high else we want to say that it’s low so we can end this and then we’re going to use the alias for this of high low real original i know let’s run this pressing control enter inspecting it we can see that our formula is working for those values that are greater than a th00and we’re marking it as high so this has everything that we need in it for my boss remember right now we’re doing this limit 10 we actually need all the different data in it so i’ll go ahead and press play looks like we have 124,000 different rows in this and if i want to export this to my boss i could click this here in order to convert this into this type of table but what’s really convenient about this is i can now copy this entire table which allows us to either export to a csv json or even markdown csv is most common so i’ll use that all right so now that’s our initial dive into this kazo data set we now have some practice problems for you to go through and get even more familiar with this data set working through some problems in the next lesson or the next chapter we’re going to be diving into using the case statement in order to pivot data super exciting all right with that see you in the next chapter welcome to this chapter on pivoting with case statements and specifically we’re going to be using statements like case when and aggregation in order to pivot data but what the heck is pivoting data let’s take a look at this simple example focusing on that first table first typically our data comes in a long format and in this case we have an example of a columns of date category and sales where we have different categories of a and b it’s very common to pivot things such as on the category here of a and b so that way we get to more of a wider format as shown below this is not only easier to read and understand and analyze but also easier to visualize which we’ll be doing in this so what will we be covering in the lessons in this chapter in this lesson we’re going to be focusing on understanding the basics of using aggregation methods such as count and sum in order to pivot data we’ll use count to analyze the number of customers per region and then we’ll use sum to calculate the net revenue based on different categories in different years in lesson two we’re going to build this up further and start looking at statistical functions such as min max median and average for this we’ll get into an example of calculating what is the median sales across categories then finally in lesson three we’re going to jump into advanced use cases of case statements specifically we’re going to be looking at things like segmentation we’ll learn how to analyze by multiple and conditions in order to look at things like bucketing for certain years based on revenue and then similarly we’ll use multiple when conditions in order to analyze different bucketing of revenue tiers and see how they apply across different categories now i just showed a bunch of visuals and the goal of this course is not learning how to build or make visuals which i will show in this but really i want to be able to show that hey with these insights that we’re gaining you can take it a step further and visualize it all right with that let’s get into it in this first example we’re going to do a review understanding count but also distinct count in order to calculate the total number of customers per day in 2023 this will be the final table that we end up getting as always if you want to follow along open up that blank sql notebook and run all the cells in it so we can get started so remember we want the total number of customers per order date we can uniquely identify this based on the customer key so add a select statement from there i’ll add order date followed by customer key and then we’ll get this from that sales table let’s start with this first so as we can see from this we have this is the first of 2015 we have duplicate customer keys but then we also have a bunch of different ones we’re going to start simple first we’re just going to do a count of all the customer keys so i’ll wrap customer key and count and provide it the alias as total customers let’s run this and this isn’t going to work right because well we need a group by right so adding that group by statement we’ll add in we want to do this by that order date and then run this again all right so now we have the order date by total customers right now i’m noticing that the dates are not in order so i’ll add in an order by order date and not too bad but remember previously whenever we were actually looking at it we could see that customer key is actually duplicated we want to find the unique customer so we want to use something like distinct so going back up into our original query all i’m going to do is add in distinct in here and then run control enter and now those numbers are going to drop right because they have a we’re going to remove all those duplicates last thing we need to do for this one is just add a wear condition for filtering for dates in 2023 so i’ll add an order date and for this i recommend using the keyword between so we don’t have to do that greater than less than all that kind of mess and then putting in between january 1st 2023 to december 31st 2023 running this we can actually check the contents yep january 1st to december 31st one quick note now on visualizing this you can use this button right here and actually select it to go through and draft different visualizations to try to understand what is going on here with the data what it will do is it will give you different previews in our case this is time series data so i know that’s the best choice to use for the visualization whenever i go to select it it will autogenerate all the different python code you need in order to visualize that data and then all you have to do is click add cell and then running this you can actually visualize it in more detail right below so that’s why i really like collab for this is because it has gemini implemented into it makes it super simple for you to just go forward and actually visualize this all right let’s now get into actually pivoting using count as an aggregation and for this we’re going to be looking at something similar from that last example understanding how many daily customers we have but broken down by region specifically three continents of europe north america and australia for this we’re going to be getting this final table where we have things like order date in the leftmost column and then we have the customers based on the different regions in their own individual column first things first though what continents do we actually have available inside of our database it’s underneath the customer table when we run this we can see as a previous report in got europe north america and australia so let’s go forward with actually adding this table into the query that we just made at that last example in order to do that we’re going to be performing very commonly a left join and that’ll be with the customer table with an alias of c and we’ll do this on our customer key and what we’ll need to do now cuz we have two tables in here we’ll need to assign an alias also to our sales table and then also to all those other different columns that come from the sales table running this to make sure that the error there’s no errors there are accidentally messed up order date run this again okay everything’s working fine now but now we need to create individual columns for total customers based on continent so how are we going to do this well let’s focus on this syntax right here we’re going to be using the count distinct that we use as we used previously and inside of this we’re going to be throwing in a case when statement it’s case when a condition then what the output we want it to be the column in this case and then end and then finally assigned an alias so i’m going to go ahead and copy this right here and i’m going to insert it in the next line underneath here but we need to go through and actually fill it out so the condition is we’re looking for if it equals a certain continent so for the continent from that customer table we’re going to see in this case if it equals to europe and specifically the column that we want from this is then that customer key so i’ll go ahead and put that in and this one will give the alias in this case called eu customers let’s try this bad boy out and bam now we have our european customers in here let’s go ahead and add the other two as well of north america and australia all right i got those in as well have the north america and then also the australian customers go ahead and run this and scrolling down we can see that based on the total customers the europe north america and australian they do add up to this line right here so that field of total customers is now somewhat redundant i’m going to go ahead and actually remove that and this will be our final query now visualizing this one similar to the last one this one i find especially has multiple different columns in it the visualizations it provide aren’t that good specifically it is here in these time series but it’s broken up to where this one’s europe this one’s north america and then this one’s australia they’re not all on the same graph so unfortunately gemini in this case is not that strong in producing graphs if you really want to visualize it and you want to know my method for it all you have to do is go ahead and click that table and then remember you can actually copy it and this is going to copy the table to your clipboard so this contents right here i want it as a csv i’ll go ahead and copy it and we’ll need to put it into some sort of document because it’s pretty long so you want to put into a document such as csv so on mac i’ll put in something like textit um on windows you’ll put into something like notepad i’ll just paste the contents in using commandv and then from there just save it inside your favorite chatbot in my case i really like chat gbt you could use gemini claude whatever i’ll give it the simple prompt with the actual document of visualize this as a line chart and then with it visualized we can actually i like going in this interact mode on chatgpt we can actually go through and you can see both the all three of these regions along with visualize if you want to download the graph you can just click it there last example for this lesson for this one we’re going to be looking at using the sum function with case when in order to look at what is the total revenue by category and we’re going to be using that case when in order to look at 2022 verse 2023 this is the final table that we’ll be creating where we have category in the leftmost column and then we’ll have the total net revenue for 2022 and then for 2023 right next to it now for this i don’t want to start from scratch so i’m going to do take this last query that we took right here and then paste into cell make sure it runs properly all right now for this i want to just start simple i want to first look at what is the total revenue by order date so i’ll just start by first removing we’re going to be done in 2022 and 2023 removing this wear clause also we don’t need this customer table so i’ll remove this as well along with these this count distinct that we did for all the different customers with it just simply like this i’ll just start run it and make sure yep everything’s appearing it’s got all the different order dates in it okay now let’s get the total revenue and that’s going to be done by using sum now if you remember from a couple of lessons ago you need three things for this quantity net price and then also exchange rate so i’ll add all three of them here using multiplication and i’ll assign this as the alias of net revenue um remember anytime we’re doing an aggregation need to have that group by let’s go ahead and run this not bad but right now we’re aggregating it by order date and we actually want to break this down based on the category just as a refresher you don’t need to run this query what we need is from the product table is right here this category name so one we’re going to need to merge this with our sales table and two extract out that category name so inside of our original query we’re going to go ahead and do a left join connecting in the product table with the aliasp and on specifically that product key we’ll go ahead and run this to make sure that it still executing properly okay good we didn’t bring anything from the product table in we need to do that now and what we want to do is replace this order date now but with category name and we have it in three places so i’m going to show you a shortcut real quick so what i’m going to do is i’m going to highlight all of this right now you can see it’s only selected on the top one i’m going to press on mac command shift l on windows you’d press control shift l and now all of these are when i press backspace all of them are removed and then as i type all of them get typed in super convenient saves a lot of time and then i’ll go ahead and run this again by pressing controll enter and bam now we have category name and net revenue for each we have this net revenue across the entire data set still need to filter down but this is pretty good so far now in order to do this bas it up on 2022 and 2023 we need to be using a similar type syntax that we used before specifically wrapped in our sum function we’ll use our case when when it meets a certain condition of 2022 2023 we’ll provide it the net revenue when it meets that condition else if it’s not that year it’s going to be zero so whenever we sum it all up you only sum up if it’s that year and then finally we’ll end it with an alias so i’m going to go ahead and just copy this all and then inside of here i’m going to paste it down below so our first condition is checking on whether the date is in 2022 so for this we’re going to be using that order date column and we want to check if it’s between a certain date specifically between january 1st 2022 to december 31st 2022 had a brain fart there for the column for this we’re going to be using what’s above here inside of our net revenue of all three of these columns multiplied together and then finally the alias we’ll provide it we’ll call it total net revenue 2022 let’s go ahead and run this and bam we have it for 2022 this is looking good let’s go ahead and put a comma on the end and i’m going to go ahead and copy this and then pasting it right below then i’ll just need to go through and update it to make sure that we’re now using instead of 2022 that we’re actually using 2023 make sure to also change the alias okay running this processing control enter we get almost our final results once again we don’t need that net revenue in there it’s not telling us what we need so we’ll go ahead and remove it and now we have our final table and from this we can see that for some strange reason from 2022 to 23 for all of these columns all the data went down that’s not really good i’ll leave that to my boss to figure out all right it’s your turn now to go through we have some practice problems aligned for this to get you more familiar with how you can use case when statements in order to pivot data using these different types of aggregation methods in the next lesson we’ll be building on this focusing on statistical functions such as min max average and median and diving further into revenue with that see you in the next one now that we’re warmed up using basic functions to analyze pivoted data we’re going to now shift our focus in using statistical functions for this specifically we’re going to be covering these functions we’re going to warm up by focusing on the easy ones first of average min and max in order to pivot that database and understand some data insights and then from there use the percentile count or continuous function in order to analyze the median revenue of sales and we’ll continue on that same trend of analyzing this based on all these different categories to see what is the highest performing category so where can we find out what statistical functions are available to us well we go to the source documentation here at postgress they have the section on all the aggregate functions which includes the statistical functions and scrolling on down we can see max and min which we’re going to be using shortly which find the maximum or minimum value of expressions across all non-null input values and then similarly we have a whole host of in-depth uh statistical functions those around correlation looking at r squar standard deviation and even variance so let’s get into actually analyzing using min max and also average i need you to start up a blank notebook for you to work with and so what are we going to be analyzing for this well if you remember back from last lesson we calculated the total net revenue by category broken down for 2022 and 2022 we’re going to be doing a very similar approach to keep it simple looking at things like min max and average because of this if you still have that query you can go ahead and just copy that query right above as we’re going to be reusing that and modifying that to apply these new functions so inside my blank notebook i’m going to go ahead and paste that right here and actually running it press control enter first one we’re going to try is average and this finds the arithmetic mean of all non-null input values so pretty simple in here we’re going to keep this query mostly all the same but instead of doing sums here we’re going to be performing the actual averages because of that i need to name the aliases appropriately naming them average net revenue 2023 i’ll press controll enter go ahead and run this and bam now we have our average values and this is pretty neat because if we remember back computers had the highest total revenue but yet in this home appliances have the highest average net revenue anyway if you want to visualize this we could click that graph thing and try to actually visualize it below but being that this is categorical data i don’t find that the graphs that ever provides are that good because this table is so small i actually sometimes can just take this and copy the entire contents and just paste it right inside the chat itself since it doesn’t take up too much space and give it the prompt to visualize this and then bam we can see the different average values for these different categories across the years and actually compare them seeing things like computers actually were lower in 2022 than 2023 now that we understand average let’s explore min and max and it’s going to be a very similar syntax to this so i’m going to use some ai to automate this i’m going to go ahead and just copy this query right here open up gemini and i give it the prompt add in min and max statements similar as done with these average statements and then below this just go ahead and paste it in let’s see if it can actually do this going to expand this out to actually be able to inspect this and it looks like we got it done so we don’t need to be all repetitive anytime there’s repetitive task and give to ai to actually do this i’m going go ahead and actually insert this into here and close out of gemini and then i have all of these syntax errors because remember we don’t have the magic command for sql at the front and now i can go ahead and actually run this pressing control enter and bam now in one table we have not the average the min and also the max all formatted and typed out correctly pretty neat with ai so let’s crank this up a notch and start looking at a similar analysis but now using the median for those that are not familiar with what a median is if you were to have a list of numbers and then you were to sort them in order the median is the middle number so in this case when we have these seven numbers right here six is the middle number whereas when we have eight numbers we take the average between the fourth and fifth number which could be only four and five and that becomes 4.5 now median is extremely important especially when you’re working with data in this case we’re looking at salary distribution this from my python course and we’re looking at salary distributions and you can see that we have the salary go up and then go down but then it goes out and we have like high out values way out past 350,000 if we just use the average this average is going to be pushed towards a higher number and is not going to be representistic of the actual data so median helps fix this issue by basically sorting all those numbers taking that middle number and getting a more representistic number of what you would expect to see in this case salary what would you expect to see i wouldn’t want to expect to see a higher salary when i know i’m going to get a lower so you may be like “this is pretty easy all i got to do is change all these average values to median and then run this unfortunately there’s no such function as median hence the no function matches this given name and argument type and that’s common across all databases whether you’re using postgress sql server or mysql we’re going to use this percentile con or continuous and that continuous portion is a key part because now this function that we’re using percentile cont is not only an aggregate function but it’s an ordered set aggregate function so what does this all mean not only do we have to use percentile con and what fraction we want to use for this in this case the median or half of it is 0.5 but we also have to use this other syntax here of within group and then order by so let’s actually break this syntax down using this percentile continuous function we need to provide it basically a list of ordered values it’s not going to sort it itself and be able to pick it out like other aggregate functions because of this we first in parentheses have this order by column this specifies how we want to order the values that we’re going to be picking out the median value from but this only sorts the values we actually need to bind it to this percentile continuous function that’s why we have this within group portion right here to bind that ordered set if you will to that function so let’s just take a simple example first using net price here i have a simple select statement of net price we’re taking it from sales itself and it looks like we have almost 200,000 rolls of net price let’s get the median value from this so i’ll first start by using that percentile continuous function notice they also have a percentile discretet that’s if you don’t want it to average if you have two numbers in the middle you don’t want it to average if you want to actually pick a certain value you use discretet i mostly stick to continuous and then we’re finding the median value or that that is at the 50th percentile hence 050 and we need to first bind what we’re going to be binding this so we’ll use that within group and then in parenthesis we’ll use that order buy specifically of that net price and then we’ll assign an alien an alien an alias of median price let’s go ahead and run this bad boy and bam we get that median price of $191 just out of curiosity i’m going to compare it to that average net price and similar to what we saw with that salary data that i was showing previously the average price is much higher and that’s because we have these high value items that aren’t necessarily purchased as much driving that average up so media in this case is a lot better at getting a representistic understanding of what the common net price is people are seeing so what are we going to be calculating this last example well we want to find out what are the median sales by category comparing 2022 to 2023 notice here we’re going to say sales i’m going to use sales more frequently um but this is technically net revenue but in the business side we typically just say this is sales i don’t like to start from scratch from this so i’m going to work with that very last query that we just did where we found out the average min and max four different categories so i’m going to go ahead and just copy this all and then paste it into a new code cell here remember we’re not going to be using this average min or max so i’m going to go ahead and remove it let’s just start by getting the median net revenue or sales for basically all the years and then we’ll filter down by 2022 and 2023 after so i’ll define that we want the median by specifying percentile continuous we’ll use the binding function within group and then we’ll use that order by and this will be done on the net revenue which is quantity times net price times that exchange rate we’ll give it the alias as median sales all right let’s see if this bad boy works controll enter all right not bad we have our median sales for our columns but fast forwarding to the future remember we want to call them on 2022 and 2023 so starting with 2022 first i’ll put year 2022 here so know that that’s what we’re working with basically we want to provide the necessary values right here inside the parenthesis to filter down for 2022 so because of that we need to use a case when statement so i’ll add in case and then from there i’ll press enter and indent this in and then we need to fill in when the column name equals a condition then we basically we want this value of the net revenue and so after that we’ll actually end it okay so we need to fill in this column name equals condition mainly we want to meet the condition of verifying the order date is in 2022 so we’ll remove this out and we’ll start by defining the order date we’ll use that between argument and specifying that we want between january 1st 2022 to january or december 31st then it’s equal to these values i’m going to go ahead and just for good measure to make sure it’s a little bit more readable put that into parenthesis so just to be clear right you can see this order by we have this pink parenthesis right here it’s then doing a case statement to determine if they’re within a certain date then it’s equal to this value else not going to have any values let’s go ahead and run this all right not too bad i want to add in 2023 now i don’t feel like retyping all the values so i’m going to use gemini for this and i’ll paste in the code giving the prompt of addin 2023 also under this thing and it looks like it did it correctly i’ll go ahead and insert this in double checking it yeah looking good and now running this final query we have the median sales for 2022 and 2023 so just taking it a step further actually analyzing this we can see that comparing those median sales to that total net revenue which is also total sales we can see some interesting insights specifically for computers from 2022 to 2023 the median sale actually went down and corresponding with this the total sales of that same category of computers went down so maybe it’s something you could bring up to the svp of computers all right so you now have some practice problems to go through and get even more familiar with using these statistical functions when applied with pivoting and using case when statements in the next lesson we’re going to be getting into advanced segmentation we’re going to be learning how to use keywords like and and when to break down analysis even further with that i’ll see you in the next one welcome to this last lesson on using case statements in order to pivot data and in this lesson we’re going to be going into advanced segmentation so what is segmentation well it’s a really important data analytics concept in order for you to take large data sets and break it down into smaller pieces in order to analyze different behaviors as a data analyst i’m applying this concept all the time when i’m using large data sets so i can dive deeper into the details and understand different behaviors so how are we going to do this well we’re going to start off easy in our first one we’re going to learn how to use the and statement within a case when statement in order to analyze multiple conditions for that net revenue of all those eight categories we’re going to break it down looking at segmentation of year based on 2022 or 2023 along with looking at whether it has a high or low value median price mainly look at the net revenue for orders that are less than the median value and look at the net revenue for those greater than the median value for a second and also final example we’re going to be looking at how we can use multiple when clauses within a single case block this is particularly important whenever you need different outcomes based on different conditions and for the example we’ll be doing we’ll be breaking down the revenue into multiple different tiers so instead of looking at orders less than and greater than the median value with this we’re going to take it a step further and we’re going to be able to look at orders based on where they fall in certain percentiles now before we get into both those examples i want to just demonstrate it real quick how to actually use both of these concepts the first concept is we could use and to combine multiple conditions within a case when statement and this is simply just done by adding condition one and condition two and use that and end statement so for this one we’re going to be looking at two things quantity and net price i’m going to go ahead and run this query right here we can see based on our order date we can see things like the quantity and then also the net price what we want to do is create a new column and classify whether these orders are a high value order so basically i have a quantity greater than or equal to two and the net price is greater than or equal to maybe $50 and then if it’s not that we just want to call it a standard order so going to insert a new line underneath this select statement starting with our case statement and then indent it in to insert in when we’re going to be checking two things right we’re looking at what is the quantity and that it’s greater than or equal to two and adding in the and keyword we’re going to look at the net price and whether that one is greater than or equal to 50 in this multiple condition case we want to categorize this as a high value order else we want to categorize this as a standard order we’ll end this case statement and give it the alias of order type okay running this pressing controll enter we can now see that this allows us to do multiple conditions and those that have greater than one or greater than 50 are categorized as high-v value orders the second concept to understand is we can use multiple when clauses within a single case block now there’s no limit to the number of wens we can put into here but basically it’s shown as every time you have a when you then need to have that then keyword and specifying what you want after that and then usually end it with some sort of else now i find myself using this type of approach of multiple when statements whenever i have to break it up into different categories for example we’re going to be following the same approach right looking at quantity and net price but we’re going to want to categorize this now so previously we were just categorizing high-v value orders those greater than two and greater than 50 and everything else is a standard order instead we want to more precisely fine-tune this we want to classify when it’s a multiple high-v value item so greater than two and we’re going to change the value to greater than 100 then we want to also check for single high-value items so those that are greater than 100 but are a quantity less than two so one and then we’ll categorize those that are multiple standard items so those that are greater than two but still less than 100 and then everything else is going to be a single standard item so we can build on this query that we already used previously one thing to note is like i said we changed this so this is no longer 50 i’m going to change that to 100 and call this multiple highv value order for this we just enter a new line enter in our when statement and then use what we’re going to analyze for is the net price greater than or equal to 100 then in that case we’re calling it a single high-v value item next when quantity is greater than or equal to two then we’ll have multiple standard items and then finally the else will be a single standard item okay let’s go ahead and run this bad boy and scrolling on down we can see that it appropriately classified based on these multiple different conditions pretty neat all right so if there’s anything you get out of this video it’s these two concepts because we’re now about to get into more technical examples to demonstrate how you would use this in the real world so as always i like to start out with what we’re going aiming to achieve in this and similarly we’re going to continue our analysis looking at our categories but for this we want to not only break it down by year looking at revenue in 2022 versus 2023 but the reason why we need this and condition is we want to segment it further in order to understand the low revenue and high revenue what do i mean by that in the last lesson we looked at what was the median order value for a single order well basically we want to see what is the total or net revenue for orders less than median and then what are those greater than median thus high now technically as we showed in the last lesson we have median values for each of the different categories in order to understand how this and condition works in here we’re going to keep it really simple at first and we’re just going to calculate a single median value for all the categories and apply it to all so let’s start by calculating that median value we’ll start with select and then use that percentile continuous function and then use within group as our bridge to then put in order by for calculation of our net revenue of quantity net price and exchange rate we’ll name this as the median for right now we need this from the sales table and since we’re working between 2022 and 2023 we’ll add a wear clause on this as well specify and use an order date between 2022 and 2023 and i have a syntax error because i applied an alias of s for sales we’ll be using this later so we’ll keep it i’ll add that s there run this again and we get a median value of 398 so remember what do we want to do we want to calculate basically the revenue that is less than the median value and the revenue that is higher than the median value so just as a reminder we’re trying to calculate with that median value of 398 for orders that are less than 398 what is that total revenue low revenue and then also the higher what is higher and doing this for 2022 and 2023 so what i’m going to do is go back to the problem that we used previously because we reuse this where we’re calculating those median sales um at the end of lesson two in the last lesson i’m going to go ahead and copy this and then back in our blank notebook underneath our median calculation i’m going to go ahead and paste this in now we’re not going to be using these median sales like we calculated it right here so let’s just run this to see what we have right now and it should just show all eight of our categories now we can add in those calculations so specifically for this one i don’t care about the 2022 and 2023 23 i just want to calculate the low net revenue and the high net revenue we’ll start with low first we know we’re going to be adding everything up so we’re going to be using the sum function for this and then we’re going to have a case when condition and it’s going to have that syntax of case when the condition then the value and then end it and we’ll end it with the alias of low net revenue so the condition is we want it to be less than that 398 that that order value is so we’ll add in quantity time net price times exchange rate and make sure that it is less than 398 this line’s getting a little long so i’ll go ahead and bump this on down and then also indent it in and now we need what is the value well the value is what is that revenue so i’ll copy the same formula that we used above command c and place this into value one so let’s go see if this works right now as written and looks like it does do these numbers make sense yeah they do so now let’s add a statement for the high net revenue i’ll go ahead and copy this all add a comma and then paste it into here changing this to greater than equal to 398 and then changing this to high and then going ahead and run this make sure we have no errors looks like we have everything here so now we’ve at least gotten that low net revenue and that high net revenue the next thing we need to do is now actually segment it down based on 2022 and then 203 and this is where we’re finally actually getting into what we’re trying to teach with this concept of using an and in a case when condition specifically inside of our case when we can use two different conditions using that and keyword so basically we want to add our multiple conditions right here inside of here before the then so what i’m going to do is enter down and then indent in and start that second and condition so we’ve already checked for the first condition of hey is it less than the median value the next thing we need to look at is the order date between january 1st 2022 and december 31st 2022 and i can actually go ahead and copy this and also place it in this condition as well i’ll then update this to be this is 2022 and this is also 2022 let’s run this make sure that it’s working properly and i got a little bit of a typo i added two betweens in here i don’t know what i’m thinking there try it again okay now it’s working all right this is looking pretty good now all we got to do is add in 2023 so i’ll use gemini for this cuz i don’t really want to fix all that code paste in the formula giving you the prompt of add in two columns for 2023 it generated it so i’ll add that code cell in and then from there add in that uh sql magic command run this pressing controll enter and bam now we have that net revenue for 2022 and 2023 whether low or high so that’s how you use the and condition but i’m going to be honest this query i don’t like hard coding in values like this and so i’m going to show you real quick a little advanced technique so that way you don’t have to hardcode values in so if we scroll back up to our original query right here that actually calculates our median value we can use a cte to insert in this value into that query now as a refresher for cte they start with that width command and then from there you’re going to be assigning the name of that cte so i’ll name this one median value and use as from there i’ll put an opening and closing parenthesis along with actually pasting in that value i like to indent it in to just make it easier to read we’re not using it yet but i’m just going to press control enter to make sure that my query is still operating correctly it is and then now all i need to do is insert it into this from command right here so i inserted it in as median value giving it the alias of mv and then if we remember the column name is median so i’m going to replace that 398 remember we can press command shift l to select all of it and then type in mv.median okay let’s run this bad boy hope it works and it does gets our final value now this is pretty good now that we can see these breakdowns between the years based on high and low net revenue we visualize this we can actually understand better what happened with the computer sector specifically remember computers drop down in revenue well it’s not for those low revenue or those value of orders less than the median value really what happened was is they saw a drop in the orders for those high value orders those that are greater than the median so pretty interesting insight that we found out and we actually break it down further in the next example in this example we’re going to be building on that last example further by using multiple when clauses within a case block previously we were only using one when clause but we’re going to actually step it up a notch and use two when clauses why is this important well this is going to allow us to segment within a column into in our case different revenue tiers so all of these will be categorized whether high low or medium along with the associated calculating for the total revenue so what does this high low and medium revenue even mean we’re going to be segmenting based on where an order falls within its percentile specifically if an order’s revenue is less than the 25th percentile we’ll cate categorize this as low between 25 and 75 is medium and greater than 75 is high now why this 25th and 75th percentile well it’s actually pretty common in statistics to use these values in order to bucket things into their different quartiles technically statisticians like to call that range between 25 and 75 the entire quartile range we’re just going to call it medium and then everything less than this low and everything higher than this high so that’s the basis on where we’re getting these numbers for now you may be like luke how the heck do i calculate the 25th and 75th percentile but remember the median is the 50th percentile so that’s why in this case we use that 0.5 so let’s go ahead and copy this that we used before and i’ll put in here.25 and we’ll call this that 25th percentile going ahead and running this we get an error because silly me can’t start an alias with a number in sql we actually need to use a letter to start so i’ll just say hey this is revenue 25th percentile and now run this okay not bad let’s do this for the 75th percentile and adding all in pressing control enter we have the 25th percentile and 75th 75th should be much greater than the 25th as expected looking good all right we’re going to be using these values in our final query so we’re going to be not going to be hard coding them in so i’m going to create another cte so i’ll go ahead and tab this over and then from there create a ct of percentiles and then assigning it inside parenthesis here so running this make sure it works just fine not going to cuz i need to insert a query below it i don’t want to start from scratch with that bottom query so i’m going to actually just going to scroll up to that previous one that we just created i’m going to copy this one and we’ll modify this one to make sure that it works pasting it in here first thing i’m going to just get rid of all of these conditions that we created along with that median value because we’re no longer using it but i will go ahead and add in that cte of percentiles and we’ll just name it something like percentile okay let’s actually try to run this and i can see what the error is now i have this comma after the end before the from let’s try this again okay we got all the categories as expected and we have our cte inside of there so the first thing i think the easiest is because we already have this group by and everything like that let’s add an aggregation in here to actually provide what is that total revenue so we use the sum function for this then we’ll add in the quantity time net price time exchange rate and we’ll give this the alias total revenue all right let’s go ahead and now run this all right so we have our categories and total revenues we now need to do one more step and this is actually what we’re trying to learn of implementing specifically breaking down all these different categories into those revenue tiers and so for this we’re going to be using the multiple when statements within our case i’m going to go ahead and copy this right here and then paste it in between here all right so the first thing that we want to check for is whether we’re mean that low tier condition basically that the order revenue is less than that 25 percentile so for the condition we’re checking whether this value here i’m going to go ahead and copy this and paste it in here is less than or equal to that revenue 25th percentile that we’re calculating up here so we brought it in with that alias of pctl and that’s that revenue 25th percentile and we’re going to assign it the value of low for the next condition i’m going to go ahead and just copy this right here paste it into condition two and i’m going to just say we want to check to make sure that it’s greater than or equal to the 75th percentile in that case it’s going to be high everything else is going to be classified as medium for the alias for this we’ll name it as revenue tier all right looks like everything’s in order let’s go ahead and run this bad boy press a control enter and we get an error which i’m catching it because it’s pointing out that it’s with the when clause right here now we’re aggregating right we’re doing the the sum of the total revenue based on these different tiers so technically we also need to do a group by for this so underneath here i’ll add in revenue tier sher run this again and bam now we got it not too bad now technically i would like in the order i’m a little nitpicky i would like the order low medium high so i’m going put number values out in the front of these in order for it to be able to actually order in the correct order so i need to actually add that in underneath the category name and we’re going to do an revenue tier run control enter we got a comma run control enter okay now we have it and better order high medium low 1 2 3 and this is pretty neat cuz we’re able to do multiple segmentation in order to analyze these different revenue tiers and we actually visualize it when we put into something like a 100% stacked column chart we can see that we have the high the light blue whereas medium and low are getting darker something like that computer sector that we keep on talking about they are very reliant on their revenue coming from those high ticket items those that are greater than 75th percentile whereas something like games and toys are highly reliant on low and medium value items so pretty interesting insight one last technical note for both the first and last problem we used these 25th and 75th percentile across the entire range of categories and similarly for that first problem we use the median value across all the different categories technically this isn’t necessarily the best practice you should do for this we went back to that first problem you’d actually want to calculate the median for each of these different categories and then from there actually segment it down and break it down further but whenever we look at the query it breaks down it gets a lot more complex and isn’t really what the focus is of this lesson focusing on that adding multiple when conditions or using that and condition but if you like to see the nitty-gritty technical details they are in the notes all right you now have some practice problems for you to go through and get more familiar with using these multiple different conditions for segmentation with that we’ll be moving into the next chapter on dates and we’re doing a lot of different functions with that with that see you in the next one welcome to this chapter on date calculations and in this we’re going to be learning how to use different date and also time functions and keywords in order to analyze data now the first lesson we’re going to get an intro and how this is useful in performing time series analysis specifically we’re going to use things like date truncate and two character in order to calculate things like the number of unique customers or net revenue per month in the second lesson we’re going to fine-tune how we can extract out certain components of the date and also use things like the current date or now in order to investigate certain time periods from when we’re analyzing it and in the final lesson we’ll cap it off with keywords like interval and functions like age in order to calculate things like average processing time and compare that to the number of orders we have and this is all very important and so kelly and i included this in these beginning chapters because you’re going to see as you go throughout the rest of the chapters a lot of the concepts were learning with how to manipulate dates are going to be used in those future chapters date and time data is everywhere you go you can’t get away from it now i highly encourage you for any of these functions or operators if you’re curious of learning more go into the source documentation which provide the link over here or maybe over here so we’re going to be using the date trunk function so what does this do well in date trunk you provided a field that you want output from it whether something like seconds minutes hours days weeks whatnot and you provide the source usually this is in the form of a date or time so let’s go over a simple example first i’m open up with a blank notebook here i just have a simple query where we’re looking at the order date because that’s what we’re going to be manipulating with this date trunk formula from sales and we only have 10 uh outputs so i’ll controll enter okay this is good now all these dates are the same um but i want to be able to see all the different dates so a little trick you can do is i’m going to use order by and then use the function random now whenever i run this pressing control enter i’m getting random dates right here so we can better see if it’s actually applying to a lot of different data anyway let’s get into that date trunk function so typing this function right here we can see that it also outputs in here hey we first need our date expression and then we need our date part which i just clicked to open up we can also use this documentation to further investigate what is going on here so pretty convenient of what’s going on here okay so we’re going to first input in that date expression and it’s a string so it needs to be in single quotes i’m going to put in month from there we need to put in the date part so i’ll put in order date now we’ll run this pressing controll enter and we get if we see these these order dates we can see that it’s just the month now notice the data type of this it’s getting converted to a timestamp and this is a little bit inconvenient and a little bit too verbose for me so we can clean this up specifically if you remember from a few lessons ago we can use this double colon sign which is the cast operator and we can cast this instead of as a timestamp as a date running control enter now i just have it output as a date take it one step further and also just rename this as order month and we’re good to go all right so what are we actually trying to calculate in this exercise this is the final table that we’re aiming to get to in it we have the order month which just showed you how we did and we want to get with this we’re going to use group by to analyze the net revenue and also the total unique customers so let’s start with that query that we just previously built we’re going to add onto this by first calculating the net revenue remember we use the sum function for this this is calculated by multiplying quantity times net price times exchange rate and this is our net revenue so i want to make sure that this is operating correctly and silly me we’re doing an aggregation of sum so we need to perform some sort of group by so instead of this order by i’ll put in a group by and specifically we’ll call out that we want to do it by the order month i’ll also go ahead and just remove out that order date so we don’t have to add it to that group by run control enter okay not bad getting net revenue per order month we only have 10 results right here i have it to limit 10 mainly i do that when building queries so they operate more quickly i’ll take it off at the end next thing to get is total unique customers so i’m going to go ahead and add that in we need to do a distinct count of the customer keys so count and then distinct specifying customer key and then we’ll assign the alias of total unique customers running this okay we got totally unique customers net revenue and order month so exactly what we needed out of this we can go ahead now remove that limit 10 and press ctrl enter bam now date trunk is really great especially if you just want to specify one attribute you want to extract out of it such as something like month as we did you could either do quarter year decade century or even millennium and so if i like to customize it more i like to use the two character function specifically you provide it something like a timestamp and then the text output and it outputs it in that text format scrolling on down we can see that it has a host of different options that we can use for this you can specify a lot of things like hour of the day year even things like month and the good thing about this is you can actually combine these together in a format or in an order that you want so let’s just show this simply by implementing in order date and formatting them as month and year we go ahead and press control enter and we have our random dates right here let’s now add this new function in in its own line so enter two car and then the next thing is the actual field itself and then next is what we want to output so in the case that we want something like just the year only i’m going to put it in single quotes and then go ahead and run it and we can see unlike the last one where we had to like cast it as a date and remove all the time and stuff it just outputs what we need and then if i wanted something else say not only the year but also the month i could just put it in there so double m in that case running control enter now we have the month and year so this table is super helpful in understanding what are the different formatting options that you have for this what you can use so back to the original example that we were working with we can actually replace this entire line and use this two character function specifying order date and then how we wanted it formatted and of course we’ll give it that alias similarly of order month so it’s performing the group by properly i’ll press control enter and so let me forgot a comma after this and so now i feel this output is a lot more readable regarding this order month because that removes that day and we can actually see what it is for each of these months the revenue and total customers now we because we aggregated this on a monthly basis vice that daily basis that we were previously doing we removed a lot of noise and from this we can see that in 2020 we had obviously some sort of worldwide event that caused an impact in the number of unique customers and also net revenue that we had but it looks like as of 2022 these numbers have returned back to normal except a slight dip in 2023 maybe something we’ll have to investigate later all right it’s your turn now to go and test these out we have a few short practice problems for you get more familiar with this in the next lesson we’re going to be jumping into even more complex formulas such as current date or even now with that see you there in this lesson we’re going to build further than we learned in the last lesson specifically by understanding more about how we can actually filter dates and even do it dynamically first we’re going to learn two more functions on how to format different dates and for this we’re going to be diving into how we can analyze things like the net revenue for each year for every category and then from there we’ll use things like current date and now to basically filter data by a certain time time frame from this time period now pretty neat all right first of the two functions is date part and this one extracts a specific component from a date or time stamp as we can see we have date part function unit and then what is the source or column name host of different options so with the sample query we can look at things like the year month or date here we’re using that date part function specifying those different components the applicable column and we give it appropriate aliases one thing to note with this which is not necessarily my most favorite part is that they come in with precision so they have decimals after it and i don’t necessarily want this depending on what unit i’m working with because of that i prefer to use something like extract and it has a very similar format to date part it’s actually based on date part and we can see as going through this we can do things like day decade dow and whatnot basically all the same things that we can use in date part we can use in extract the syntax for this is slightly different though in this case we’re going to use the unit and instead of doing a string for it we actually do an uppercase actual variable name and then we say from the source in our case our column name so this bottom query is using that extract and we’re going to do exactly the same thing that we did just above in that similar example using date part we specify year month day from order date and provided the appropriate alias for it let’s go ahead and run this bad boy and like i mentioned i like this one a lot better especially when dealing with things like year months or dates where i want in your digits for these values so let’s use this extract function in order to analyze the net revenue per order month now previously right we use this two car or two character function to actually analyze per order month what the net revenue is let’s instead create columns for months and then also for years so i’m going to go ahead and remove this portion right here and put in extract it gives us a hint up here we want to put in the part first well we want the year next we’re going to use the keyword of from and then the date expression specifically we want to use order date we’ll give it the alias of order year next we’ll get into adding that month one we’ll write the extract formula do it for month from and specify order date for this we’ll give it the alias of order month now the group by for this we’re doing two columns now so i’m going to want to actually do order year and order month this looks good let’s go ahead and run it and not bad it’s all over the place so i’m actually going to go ahead and change this to do an order by at the end of this so that we can get some semblance out of this data and bam this looks a lot better and now we have this in different columns so depending on how those that i give this data to they can slice and dice it even more easily all right let’s actually get into some new concepts and implement dynamic filtering by using things like the current date or the time now in order to filter back let’s talk about current date first so typing a simple select statement along with current date it provides me the documentation for this and it basically says hey it returns the current date as of the specified or default time zone parentheses are optional when called with no arguments basically you can provide a time zone if you want anyway running this we can see we get the current date i’m filming this apparently i’m filming this on valentine’s day that reminds me i need to call my fiance and actually wish her happy valentine’s day so i’m glad i saw that anyway that’s current date let’s go to the next one and that one is using the function of now similarly i can run a select statement with this just calling the function make sure you do have an open and closing parenthesis and we run this and we can see that it is valentine’s day at 2:30 in the morning okay actually not filming this at 2:30 in the morning this is actually green witch meime which is over in england so that’s what time it is there so that’s why for the current date it gives you the option to actually throw in a time zone in there to update it appropriately so what are we going to be calculating well the short answer is we’re going to be looking at understanding what is the net revenue per category for those orders 5 years ago back from today we’re basically building this table the important thing to understand is this this is a dynamic filter and these type of things are very important and understanding to do because sometimes you want to or you’ll have workflows set up that run queries automatically at midnight and you don’t want to be pulling in all the data maybe you only want the data for the last 5 years and things like this are great for that so let’s start with a base query that we’ve seen time and time again first we’re going to be extracting out the order date the category name and then perform a calculation for net revenue we’re going to be doing this from the sales table and left joining it with the product table on the product key and then because we’re doing aggregation above for the net revenue we need to actually group it by the order date and the category name pressing control enter we got this and it looks like these dates are unordered so i’m going to go ahead and throw in an order by run this and now we have the order the dates in order so we only need one do one more step but i’m going to break this step down because we want to include only orders within the last 5 years basically we shouldn’t be seeing anything from 2015 i’m in 202024 as of filming this so we’re going to create a wear filter in here to do this but i want to break it down slowly to show what’s going on step by step so i’m going to insert in how what components we’re going to use to filter within a wear clause we’ll start simple first we’ll look at current date and see what it outputs to here as expected we get to see it’s valentine’s day now in order to extract out these last five years we need to get what is the year in the current date and also what is the year in the order date so for this we’re going to use that extract command we want from it the part which is the year and we’ll do from we’ll keep it simple first with just the order date i’ll give it the alias of order year running this see it’s working just fine we’re getting that order year for all those different order dates next let’s extract out the year from the current date so we’ll just put in here that keyword of current date and we’ll say this the alias of current year run this okay not bad now we want the year that we’re going to be filtering by right we want it to be basically 5 years ago so all we need to do is i’m going to copy this one we’re going to be getting rid of all these um but i want this one right here and instead what we’re going to do is we’re going to do minus5 and we’ll set this one as minus5 clever i know running control enter we can see that the minus5 is actually five behind this so what we need to do in our wear clause is combine these in a way to where it filters for that so wear clauses go underneath from or the left join and for this we want to make sure we want to see the order year so i’m going to go ahead and copy this above c is greater than or equal to this minus5 value that we did right here and i’ll go ahead and post that in here let’s go ahead and run this crl enter and bam now this is a little bit hard to read but we can see just looking at the order date column these are the orders for the last five years i’m going to go ahead and remove these unnecessary values now we don’t need this anymore that was just for building this and we can see that we have this now this query’s built for the last 5 years now you may be like “luke this is it’s valentine’s day right now but this is going back to january 1st 2020 what if we went to be very precise about that?” and i’ll say “aha to that we’ll answer in the next lesson.” all right you now have some practice problems to go through and get more familiar with using these different things and creating your own dynamic filters in the next lesson we’re going to get into date differences basically using functions like age to measure the time between different dates with that see you in the next one in this third and final lesson on this chapter on date and time function objects we’re going to be going into now understanding how to calculate intervals in the first half we’re going to continue on from that problem from the last lesson and instead of making really that verbose way of calculating the last 5 years we’re going to use the keyword of interval to write much more succinct queries and readable queries to understand what we want to get an interval of in the second half we’re going to be going into a pretty interesting business problem of exploring average processing time so in order to calculate this interval between the order date and also delivery date which we know we’re going to use functions like age and also show what can be done with that previous function of extract first let’s explore how to use this keyword of interval interval can represent a span of time such as days months hours minutes decades or even weeks and we use this by using the interval keyword and then a value and unit so let’s test this bad boy out i’m going to run a sele simple select statement specify the keyword of interval and then let’s do something like five centuries with this we can see that it gets the title of interval and it calculates it to be 182,500 days anyway normal output for this is in days whether you’re using centuries or even use something like months running control enter comes out in days so how can we use this in the query that we used in the last lesson to filter for orders within the last 5 years i’ve simplified the query basically we’re pulling out the current date order date from that sales table and we use this formula of pulling the year out from order date and the year out from the current date and subtracting five run this we can see that the current date is valentine’s day and the order date is within that last 5 years now notice this right i called it out last time this goes all the way back from january 1st 2020 so technically this is slightly greater than 5 years so let’s write this query a lot more succinctly i’m going to go ahead and remove this portion right here and for this we want to make sure that the order date that we’re actually trying to filter for is greater than or equal to dates of 5 years ago so we can use once again that current date and we can subtract from it the interval of 5 years running this bad boy we can see that this one now does the last 5 years as shown by the order date and it’s very specific right it gets all the way down to filtering it to february 14th valentine’s day in 2020 so getting into cleaning up that full query from last time this is actually it right here if we run it again we can see that it does the current date the order date category name and net revenue so it breaks down by category the different net revenues this one technically remember wasn’t 5 years so what we can do is go back and replace this portion right here with that newly formated clause that we came up with and then whenever we go ahead and run it we can see that now we have it for the last 5 years all right before we get into age and also review of that extract function let’s look at what we’re actually trying to solve in this portion of the lesson if you recall back we have two columns an order date and delivery date column i forgot to put a comma here so it’s not appearing now it is um so we have things like order date and delivery date what we can do with this type of information is calculate how long what is like an average processing time for a customer to receive an order it’s a very important metric whenever used in business analytics so what we’re going to do by the end of this is show on a yearly basis not only the net net revenue which is those blue bars but also what is the average processing time for those years and we find that it’s going up so what are we using for this well we’re using the age function and with this we provide two in this case timestamps we provide dates and it’ll output an interval let’s do a simple example first running it just right inside of a select statement we’re going to be using the age function and then we’re going to provide it two dates now i have a couple of errors with this i’ll go ahead and run this first pressing control enter and we have this render age and it says hey no function matches the given name and argument specifically it says age integer integer i don’t want to evaluate integer i want to evaluate as a date the problem is with the date we have to provide it as a string so make sure you have single quotes around it running controll enter we have it now if you notice from this i did the 8th of january to the 14th of january and it’s saying it’s -6 dates for the age function to get a positive value you need to provide the end date and then you need to provide the start date so i’m going to go ahead and place these in the correct order running this get six days now let’s say we want to do some math with this i have currently six days and let’s say we wanted to subtract i don’t know five days from this if i put in here after this minus five and try to run this i’m going to get an error specifically it’s going to have that the operator does not exist interval minus integer right now this is an interval and we’re trying to subtract an integer we need to convert this portion to an interval sorry i mean integer need more coffee well we can use the extract function i’m going to go ahead and remove this all cutting it out we’re going to use the extract and we need the part and then specify the from keyword and then the date expression for this we’re going to specify day it doesn’t need to be in single quotes for this it understands that keyword of day i’m specify from and then for the date expression i’m going to go ahead and paste in that age now running this pressing controll enter we can say see that we get that six from this and i can do the minus five from this now pressing control enter we get one so let’s get into calculating the average processing time by year and we’re going to be doing this for the last 5 years similar to this table we also need to calculate the net revenue but we’re not going to do that until the very end because it’s going to make our query a lot longer with actually joining in the table that has the revenue data so let’s start simple first and let’s just look at the order date and delivery date and we’re going to be getting this from the sales table now let’s put in a new column in for the processing time we’ll throw that age function and remember we need to put the end date first so that would be the delivery date followed by the start time or the order date and we’ll name this as processing time this query is getting quite long so i’m going to go ahead and throw in a limit of 10 in here just to start with running this we can see we’re getting basically zero processing time everything’s getting delivered on the same day that it’s ordered i want to see a little bit more different orders so i’m going to throw in an order by random and run this and now we can get some actually that have some days in there to show that okay this is actually working so let’s start getting the average processing time and aggregating it based on year because of this we need the year for this i’m going to use the date part function and as the first argument we need to specify what is we actually want we want year and we want that out of we’re going to go with the order date of when it actually started we’ll give this the alias of order year now next thing we do is actually get into aggregating this age but remember we have this processing time is an interval of of the data type interval so 3 days 0 days it’s not going to be able to actually average that so we need to use that extract function first and i need to specify the part from this specifically it’s the days and specify that from keyword and then age delivery date and order date i’ll then put a closing parenthesis on this and then rename this alias to average processing time now we’re doing an aggregation so one we’re going to need to specify a group by and then because of this we don’t need to have that order date and delivery date because we have to aggregate by that and so i’ll remove this order by random and we’ll throw in a group by right here specifying order year also this won’t have a lot of outputs i’m going to remove that limit statement then running this and silly me i’m reading this here the delivery date must appear in a group by clause or be used in aggregate function basically i forgot to use the actual function right here oopsies throwing that in of average and now running this we are getting the average processing times all right we need to clean this up still these years are out of order and these amount of digits on here are just unreadable so i’m going to throw in an order by clause underneath here and we’re going to order by the order year running this we now have it in order and now let’s clean up this average processing time i only really want two digits and if you remember from the basics course we went over the round function in there and with the round function you provide the value or column x and then after that n is the number of actual digits that you want or decimal places after this so if you don’t use anything with zero digits but we want two so we’ll go ahead and put two on there run control enter bam got average processing time over well we want to do the last 5 years so from our previous example up above i don’t like typing code if i don’t have to i’m going to go ahead and copy that wear clause and then putting it underneath the from statement now running this we have for the last years the average processing time all right now just one last thing to do we need to now add in that revenue based on each of those years for this we’re going to be using the sum function to sum it all up and we’re going to be multiplying quantity time net price times exchange rate and we’ll give it the alias of net revenue all right let’s go ahead and run this not too bad typically with these high of numbers i don’t care about these two decimal places so i’m going to use the round function again and in this case i’m not going to specify an argument at all run this and okay it looks like it’s given it two decimal places with just 0 0 what if i can just specify this and looks like i can’t specify zero but instead what i can do is i could give it something like cast and then for cast i can specify that i want to cast it as an integer we’re going to control enter bam now we have what i want and then if we graph this we can see that the average processing time over time has gone up from around a little less than one up to 1.6 even with the dip in revenue or i think the number of orders probably loosely correlated we still saw the average processing time go up so this is a good little data point to keep track of and we can pass on all right it’s your turn to go through you have some practice problems to get more familiar with the extract age and also interval keywords and functions and from there we’re going to be going into the next chapter after that on window functions pretty complex topics i’m looking forward to get into it see you there all right welcome to this chapter on window functions this is probably the most requested topic for me to cover in this intermediate course so i was super excited to get into this now we’re going to break this chapter up into five lessons starting with this lesson focusing on the syntax of window functions doing some simple examples and explain it to you then we’ll start picking up the pace looking at things like aggregation ranking lag lead and then finally we’ll close it off looking into how we can use things like frame clauses well this doesn’t matter unless you really understand what window functions are so let’s look at a simple example so let’s look at a query that’s breaking down the net revenue by an order number in it i’m listing things like the customer key and then the order key and then the associated line number for that order calculating the net net revenue and then getting it from sales we’re then ordering it by the customer key so what’s going on here okay customer key 15 made only one purchase of over $2,000 but customer 180 did three separate purchases where two of these purchases of were the same order they just have different line items so now let’s say based on all these individual orders we wanted to find out because we want to do some deeper analysis what is just the average order specifically what is the average net revenue for an order what i could do is run an aggregation function removing all those other different columns and just run it here and see that the average order order value is 1,000 but i want it in this table and i can’t get that necessarily unless i actually get it in this type of format using this aggregation but this is where window functions come in instead all i have to do is just insert in our window function don’t worry we’ll go about it over in a second it’s using the over function in this case go ahead and run it and have an error because i got a comma in here and we can see that it’s the same value as below of 132 but it’s now in our original table just like we want it so we can do even more calculations with it so why use window functions they let you perform calculations across a set of tables related to the current row like we just showed and like we showed they don’t group the results into a single output row this is very beneficial as we’re going to demonstrate some future exercises so this is using it for things like running totals ranks or even averages anyway let’s get into the syntax for this we start by defining the window function or what we want to do if it’s an aggregation something like sum or count if it’s ranking it’s something like rank or dense rank the next is over and this defines the window for the function inside of it we have the keyword of partition by we’ll get that in a second let’s actually go back so let’s walk through that window function that we just saw without using partition by and i’m going to create a new line and for this the window function we’re going to use for this is average because we’re calculating what is the average net revenue from there i’ll put our variables in there of quantity net price and exchange rate then we’ll put over and it’s very important after this that we include open and closing parenthesis even if we’re not going to put anything in there and then i’m going to give it a very verbose title to make sure we understand what it means the average net revenue of all orders going ahead and running this we can saw similar to before it’s at 1,032 but now let’s say we wanted to filter this type of window function further maybe by something like the customer key and this is where partition by comes into effect it divides the results into partitions or better said divides it into separate groups without actually having to use such as a group by clause to do this so going back to our previous example i’m going to go ahead and copy this all so we can see it add a comma new line and paste it in and this one we’re going to find partition by customer key and we’ll say that this the average net revenue of this customer let’s go ahead and run it and scrolling down we can see for customer 15 it only had one order so that’s the average net revenue whereas customer 180 had multiple and the average of that was 836 so let’s just briefly explore the power of window functions by looking at a simple example i’m not going to ask you to follow or understand the syntax because we’re going to get to it later on but this is going to demonstrate the features that we’re going to be getting to in this chapter so here we are looking at things like customer key order date and net revenue i’m going to go ahead and run this so let’s say for each of these customers we wanted to rank their orders from highest to lowest based on net revenue well i could use something like this window function that’s using row number and some other stuff that we’re going to get to and in this case we can see that it actually calculates the rank for this specifically it has the highest at rank number one and the lowest at three for 180 now let’s take it to another level we could do something like calculate the customer’s running total so in this one we can see with 180 that has multiple orders first it’s at 525 and then it jumps up to 2500 these orders on the same day so the running total is the same we could also do things like get the customer total net revenue and this one isn’t really that impressive because yeah we could see that it gets the total but personally i like it because i can then use this to maybe calculate hey what percent is this order of the total net revenue of a customer in that case i can just add it in doing the net revenue divided by the window function to figure this out and so now i can the table by the way and now i can see what is the percent of the revenue for a customer so this makes even more unique of why window functions are so powerful so let’s actually get into applying those concepts so you can actually write those queries like i just ran through what we’re going to be finding is the percent daily revenue based on an order line item to do this we already know how to do net revenue we’ll need a window function to calculate daily net revenue and then we’ll calculate a percentage from that so let’s start building this query out we’ll list the order date the order key the line number which an order or an order key could have multiple different shipments with it so therefore it has different line numbers and then finally we’ll just start with the net revenue we’ll get this from the sales table and we’re just going to limit it to the first 10 all right so and as mentioned the line number right it just says 0 1 2 3 the highest number in this is six i don’t really like that these are two separate columns so i’m actually going to combine it i’m going to take the order key and multiply it times 10 so basically adds a zero to the end and then from there add in that line number and then i’ll give it the alias of order line number we can see that what it looks like actually here it gets accomplishes what we want so i can go ahead and remove line order and order key so the first thing we want to calculate in our final table is that daily net revenue in order to do that we need to use the sum function on our net revenue and now we want to use over and then inside of over that partition by remember we want the daily net revenue so we’re going to be putting the partition by of order date and from there we get that daily net revenue now now we want to calculate finally the percent that an order line item is of the daily net revenue this one’s going to be pretty simple as it’s going to be a lot of different copy and paste so first i’ll drop in the net revenue i’m going to multiply it times 100 so it gives us a bigger digit for that percentage and then from there we’re going to divide by all of this window function right here which i’m going to copy and paste into here and we’ll give it the alias percent daily revenue not too bad it’s all over the place i want to actually see this ordered for this day to see what is the highest so i’m going to put in an order by specifying the order date and then the percent daily revenue but for this one i want it in descending order and bam we get this final table showing this percent daily revenue based on the order i went ahead and graphed it so we could compare the different order line items for that day and we could see that some of the orders i mean these are taking up of 20% up to 10% of the entire daily net revenue and conveniently up the top of the chart but the total daily net revenue so it’s pretty convenient for window functions that we can get all this type of data into a single table and it makes it a lot easier later on whenever we dive deeper into it and maybe visualize it one minor note on this query it is getting a little bit verbose in that a lot of this is repeating stuff that we could reuse and that could be done using something like a ct or subquery what i could do is actually just put it into its own subquery of the core items that we need from this so the order date order line item net revenue and daily net revenue and then from there put it into something like a subquery so i’m going to go ahead and tab this over and so that way we can do a select star from and then put this all within parenthesis and we’ll give this alias of revenue by day so it still has the same output in below but now instead of having to repeat all that different code here like i did to calculate the percentage i just come up here and then insert a new row specifying i want to do 100 times the net revenue divided by the daily net revenue and give it the alias of percent daily revenue now running this boom this has everything that we have or want out of it and in my mind it’s a slightly easier to read and to get through when sharing with others so let’s now get into performing a cohort analysis in this last example and cohort analysis is going to be done a lot throughout this project because it’s pretty popular in business analytics all right what the heck is this well a cohort is a group of people or items sharing a common characteristic and then the analysis of this examines the behaviors of this over time specifically the behaviors of that group being a personal item so what does this even look like well for this example right here this is what we’re going to be doing and analyzing or putting people into cohorts based on the year of their first purchase and then from there calculating what portion of the revenue they are contributing to the net revenue so down at the bottom is the purchase year and then over on the y axis the net revenue in 2015 where this is the data that starts it starts in 2015 everybody contributing this is from cohort 2015 so that’s all the net revenue then we get into 2016 we have their 2016 cohort along with it’s a small contribution from the 2015 cohort in 2016 we work our way all the way to 2019 and we can say see once again that the cohort of that year is the largest contributor whereas those from previous years are less of contributors now it’s important that you understand for this example that the cohort is based on the first year that you made a purchase so what are we aiming to get out of this well we want to have this final table where we have the cohort year or the year of their first purchase and then from there the purchase year or the apparent year of the net revenue with the total revenue for that cohort so we’re going to start simple first with our first query looking at using a window function to extract out the cohort year based on a customer so we’ll start with the select statement specifying the customer key we’ll also do order date so we can see what’s going on there and we’ll get this from the sales table we’re going to limit this to only 10 values also have a comma right here we don’t want that going to press control enter and bam okay so we’re getting customer keys and order dates i’m also going to go ahead and just for good measure i’m going to order by the customer key so that way we can make sure that we’re looking at this all appropriately specifically if there’s grouping such as here is 180 i see them all together so now let’s use a window function to get our cohort year we’re going to start first by just getting what is the minimum date out of or for a customer so in the case of this 180 i would expect it to be this order in july so we’ll start this window function using the minimum of specifically order date and we’re going over and then inside of parenthesis we want to put the partition by of the customer key and i know it said date right now but we’re going to just name this the alias cohort year because we know we’re going to change it okay press control enter okay looking at 180 it is in fact that lowest date for order date as the cohort year now what we want to do is extract out that year from there so i’m going to run that extract function on this and we need to specify the part first so that’s going to be year and then from and then everything after this is that date expression so then i’ll put a closing parenthesis all the way in the end run this one thing real quick i did show that how it was 2018 but now look at this if we go to something like customer key 387 we can see that their first purchase was in 2018 but then later on in 2023 they still have that cohort year of 2018 so we know our formula is working so let’s go ahead and clean this query up because i don’t need this order date at all for what we’re doing it for so i’m going run control enter now i’m noticing we have duplicates because customer key is appearing more than once so what i can do is add a distinct statement right after this and now when i do this boom i’m only getting distinct values for the customer keys and the cohort year so this table’s a lot more concise um we also don’t need this limit 10 anymore and technically we don’t need an order by either this is good start so we figured out that cohort year for the customer this is once again the final table we want to get to now what we need to do in order to basically add these additional two columns on of the purchase year and the net revenue is we need to calculate this if you will separately and join in that cohort year using something like a cte so let’s put this all into a cte that we can then join to that sales table to basically attach on for all these different customer keys their associated cohort year and then we can aggre aggregate and find out what is that total revenue so we’ll use the width keyword specifying this as the alias as yearly cohort we’ll use as and then an open parenthesis we’ll tab this on over and then put a closing parenthesis and then i just want to make sure that this works properly so i’m going to do a select star and do from that yearly cohort above and run this to make sure that it is outputting correctly yep still the same table okay now let’s join this on to our sales table for this is very important we get the correct join for this specifically we’re going to be connecting to the sales table so i’ll put that right here and then for a join we’re going to be doing a left join we’re doing a left join because we want to make sure that we have and keep all of the different sales values from it and then we’re joining using yearly cohort which has been distilled down to remove any duplicate data so we’re not creating duplicate rows for each of these i’m going to give them an alias this one s and this one y and then as far as how we’re going to join this you’ll be on the customer key from both of these tables let’s go ahead and well let’s actually do a limit 10 because there’s a lot of data all right so just inspecting the table we can see we have the customer key of 947009 and then it is joined on that customer key along with its associated cohort year so just as a reminder of what we need to get to now we need to get the cohort year and then the purchase year or the year of the order date and aggregate all this to calculate what is the net revenue now when we do this aggregation we’re not going to use a window function this time we’re going to use a group by on these different years so let’s clean up these columns that we’re actually using we’re going to use that cohort year and then also the purchase year which is based on order date so we need to extract out the year from order date and we’ll define this as the purchase year let’s just make sure this is correct before doing our aggregation next okay good we have all the order years and purchase years now we want to move into getting a sum of all the revenues so we’ll use the sum function specifying the quantity times the net price times the exchange rate and we’ll give this the alias of net revenue and so because of this i’m going to be doing a group by specifically on that cohort year and also the purchase year which we’ll need to put the function in here going ahead and running this we can see we have our results and anytime you’re finding your or getting to your final results you need to make sure that the numbers are making sense specifically i know that the net revenue for 2015 was around $7 million and i can see from this that for 15 i say 2017 mean 2015 so for 2015 is owned 7 million and this checks out also silly me i could also use purchase year in this case doesn’t necessarily have to be the function itself all right so we want all the values for this because we’re almost there i’m going to go ahead and remove that limit 10 go ahead and run this and bam we have our final values where it shows based on a purchase year how much a certain cohort contributed to the revenue for that year and this is the ultimate visualization that we get to and we can see that pretty interesting enough the cohorts from previous years don’t really contribute that much to the overall net revenue so we have a little bit of a retention issue all right it’s your turn now to go in and get more familiar with window functions by doing practice problems in the next lesson we’ll be diving we’ll be diving deeper into aggregate functions and basically fine-tuning our knowledge of how to use window functions with that i’ll see you there welcome to this lesson on aggregation functions using window functions anyway we last lesson we went through and started to apply aggregation techniques to window functions but we’re going to build slightly more further on this covering three key concepts so in the last lesson we actually used the min aggregation function to analyze the impact on yearly cohorts on the total revenue as we went through the years well in this one we’re going to do a simple analysis basically reinforcing what we learned with cohorts but this time using the count function in order to analyze the total number of unique customers and how they impact based on their cohort into future years next we’re going to move into using the average aggregation function and for this we’re going to be focusing on the long-term value of a customer basically using window functions in order to calculate what is the total amount of revenue that a customer has contributed and we’ll not only be able to break it down on a customer by customer basis but we’ll also be able to analyze it from a cohort year perspective as well and finally we’re going to wrap it up with some simple examples on understanding how to filter window functions basically where you should be applying your wear clause in order to filter a window function properly all right let’s get into it so for this example we’re going to be using count in order to aggregate our window function the syntax is all the same so major concepts remain all the same let’s get into what we’re going to be solving for this we’re trying to find out the number of unique customers which we’re going to use the customer key for this and find out based on their cohort or the first year they bought an order how they contribute to future years this graph right now is showing the total number of unique customers every single year and then from there broken down by cohort so we can see from 2019 we had those from 2015 16 17 so on for the final output we’re going to be doing it very similar to last time where we want a table of the cohort year and then the purchase year and then the number of customers or the number of unique customers based on a cohort year and purchase year now you could start with the query from last lesson and modify it to fit your need for this but i find this question is actually more simple than the last one so we’re going to start from scratch also it’s just good practice so for this we want to get not only the customer key but also based on a customer we want to get its cohort year or the first year it made an order and then also a purchase year when the year of a purchase was we’re going to be ultimately using that customer key to get the unique count anyway let’s start programming start with a select statement adding in our customer key next we want the cohort year so similar to the last time we first want to get that minimum order date for our window function and then from there we want to do it over the partition by customer key i’m going to name this alias right now cohort year cuz we’ll eventually get it into a year format right now it’s obviously just date for this we’re going to be getting this from our sales table let’s go ahead and run this just to see scrolling down to that 180 customer key we can see that it is all from june so this is looking good next thing we want to do is extract the year out of this so we’ll do an extract we need the part which is year and then from and then the date expression which we’re going to wrap all in parenthesis and then we’ll put another closing parenthesis on there to for the extract we’re control enter and we have the years now next let’s get the purchase year and we don’t need to do a windows function for this right we just need to extract the year from our order date so i’ll do an extract the part is year and then we’ll do a from and then the date expression is order date we’ll give this a name purchase year and go ahead and run it okay overall looking good everything look like it’s calculating correctly now what we need to do is move into actually getting the count of the unique customers using that customer key and you may falsely think that this is actually going to be wrong code right here but you may falsely think that you could use a group by to do this specifically we’d want to do something like count the different customer keys or i should say count the distinct customer keys and then from there add a group by to do by cohort year and purchase year however when we ever go to run this we get the following error message window functions are not allowed in a group by basically we can’t combine these window functions and group eyes so trying to use group eye in order to calculate this and save some code not going to work here just wanted to point that out but what we can do is make this into a cte and then use that cte to run a window function on to get the customer key so for this to create our cte i’ll give it the alias of yearly cohort open parenthesis and put all the syntax in there next just to make sure that everything is working correctly i’m just going to call a simple select star on from yearly cohort run this and it’s still running properly okay we’re good to go let’s refine this now now remember this is the final output that we want we want cohort year purchase year and then number of customers so let’s work to build this table for this i’ll remove the store put in cohort year first put in the purchase year and then next let’s build the window function to go through and count the unique customer keys but we don’t want it just on one thing we want it based on both the cohort year and the purchase year so in this case customer key 180 was a unique customer in 2018 which is also the cohort year but then as a cohort year 2018 was a unique customer and also two 2023 so for this we’ll first start by doing a count and one thing i’m not with this right this has some duplicate data we want to get unique customers so technically i’d only want to see for 180 this 2018 2018 but then also for this 180 this 2018 for the current year and 2023 i wouldn’t want to see duplicates so what i can do first is i can add a distinct up here onto this query running control enter and then it’s not really showing because we remove the customer key i’ll just put it in for the time being running this again we can now see that for customer key 180 now it’s just those two entries so for this we’re going to be doing a count of customer key or you can do countst star over and then this is when we would get into our window function using partition by both the purchase year and then also the cohort year and we’ll name this as num customers so let’s go ahead and run this all right not too bad the one thing to note right is this is has multiple different duplicate lines because we have all those customer keys in there so once again what we can do is we can just add a distinct right after the select statement so that way we can get it filtered down to only unique values and then from that we can see that okay this is all on order now so now we need to actually do an order by so we’ll add an order by doing by cohort year and then purchase year and scrolling down we have what we want now and i know based on previous calculations that we had in 2015 only 2825 unique customers so this validates or help understand that this data should be correct so if i were to visualize it this is what we get and it comes out very similar to our total net revenue so overall not a lot of unique insights that i found from doing this compared to net revenue but at least we explored it before moving on i want to touch on a short example specifically around what we talked about just recently on this error message that we got when we ran this query where it said window functions are not allowed in group eyes technically this isn’t entirely correct as i’m going to show in a second we’ll be able to run window functions with group eyes but overall i don’t recommend using window functions and group eyes within the same query and we won’t be doing it for the remain of the course so i want you to learn this major concept now so let’s look at this simple example we’re going through and we’re going to collect the customer key and then using a window function we want to count how many orders that a customer has and so we’re going to partition it by the customer key and this will be the total orders i’ll go ahead and just run this and so we can see that customer key 15 has one 180 has three orders so they have a total of three orders everything’s working fine but now let’s say we want to calculate the net revenue but this time we want to group this all up because i’m tired of all these separate rows and so we’re going to use a group by with this to find the net revenue so i’ll put in the average and this will be of the quantity time net price time exchange rate and i’ll give it that alias of net revenue we’re doing an aggregation so like usual got to use the group by and we’ll do this of the customer key now when we go to run this query it is going to work notice remember 180 is at total of three orders and i’m getting an error message because i’m silly and i didn’t put a comma after this column go ahead and run this now and magically this query does work now right because we have an eye with a windows function in it but now going to that customer 180 they only have a total of one orders in fact if you look at all of them and i can move into this bigger table all of the total orders for every single one of this column here is one so what’s going on here well what’s happening is window functions in the process of running through this run after a group eye so what’s happening is everything is getting grouped together and then the window function is running this causes a major issue of conflicting aggregations so if you’re not getting an error message stopping you from this you’re probably not going to get the right results there are better alternatives which we’re going to go over and mainly that’s using cte or sub queries and breaking up your queries to separate them with our previous query what happened was we ran our group by aggregation to find our net revenue it condensed all those customer keys down to one value and then secondly it finally ran that windows function so that’s why it feels there’s a one value for all these total orders cuz it only sees one after the group by so let’s fix this up so this actually works and see how we’re going to be doing this in this case i would want to run the ct first so i’m going to get rid of this aggregation of this average and we’ll give this the alias of order value also i’m going to be removing this group by because we’re not doing the aggregation anymore we’ll then put this all into a cte so there’s no aggregation inside of this one we’ll then just query this table getting the customer key and total orders to make sure that it’s aggregating properly i’m going to go ahead and press ctrl enter and with this one we can see that 180 has in fact three and now we can go through and actually do our aggregation with that cte specifically we would get the average order value and give the alias of net revenue and then perform a group by on customer key and total orders going ahead and run this bad boy we can see we have 180 with three orders and the correct value there so it’s working so for the remainder of the course you’re going to see me anytime i need to do a new group eye or a new window functions i’m going to just create a new ct and then do it there all right getting into that second exercise we’re going to be focusing on the average function and using this in our window functions overall the syntax is the same so nothing changes there but we’re going to introduce a new business concept specifically this one of customer lifetime value and as the name implies it’s the total revenue generated by a customer over their lifetime with that company we’re also going to explore some other concepts such as the average order value or the typical amount spent per transaction but that’s less of a focus for this more of ltv so for this the main concept of concern is lifetime value which we have the abbreviation of ltv and it is the total revenue generated by a customer for a business over their entire relationship with that company so what are we going to be ultimately calculating for this well we want to find out based on a customer so in this case the x-axis is that customer id we can see what their total lifetime value is and even compare it as with this dash line to the average lifetime value for that particular cohort that calculation is a little bit more tricky to get to but we’re going to get to it so the final table we’re aiming to get to is this based on a particular customer key we want to be able to extract out not only the cohort year but also the customer lifetime value and what is the average lifetime value of that cohort so we can compare that two now this type of analysis is really great because we could do something like target these high value ltv customers because they’re more likely to make purchases and that’s typically what businesses do they’re not going after the lower numbers they’re going after the higher numbers enough happen let’s actually get into it and for this first query we’re just going to focus on three things the first is getting that customer key the next is we want that cohort year which as we’ve seen before is extracting out using the year now there is an aggregation function in this of the minimum order date so we are going to need to do a group by after this but anyway next thing we want to get into is actual a sum of their total purchases so i’ll run a sum function put in the quantity times net price times exchange rate and then give this the alias of customer ltv specifically we’re summing up all of the revenue for that customer so it is that we’re going to take this all from the sales table and like i said we need to do a group by specifically by the customer key let’s go ahead and run this all right not too bad we have basically the first three columns of well the four that we need for this and speaking of that fourth column that’s what we want to get of the average customer lifetime value for a particular cohort but remember we learned previously can’t be inserting window functions in a group by so i need to put this all into a cte and then do the window function we’ll start with that width keyword giving the alias of yearly cohort then signing it this alias obviously putting it all within parenthesis like usual i want to double check my work so i’m going just do a select star from that yearly cohort and everything’s outputting like we saw previously good to go so far inside of our main query we want to one we want to have all the different columns so i’m going to do a select star it’s only three feel fine doing this but next is where we actually want to do that window function using average and for this we’re going to use that value of customer ltv over in parenthesis going to do a partition by and we want to do this by cohort year we’ll assign this the alias of average cohort ltv okay let’s go ahead and run this not too bad one thing is i probably want this ordered in a particular order specifically on the customer key and the cohort year so i’m going to stick an order by in there and specify cohort year and customer year and bam this is the final table that we need for this and this can tell us some different statistics that we can now use like i said we could send this off to maybe something like the sales department or marketing department and they could do targeted ads at these high-value customers in order to have a potential or more of an impact on revenue now i also graph the average lifetime value for each of the cohorts and as expected it does decrease over time that’s expected because they have less time in the cycle than say somebody that 2016 or 2017 guess the one thing to note is 2015 and 16 or less than 17 so the first few years we started off we didn’t do as well with customer retention and extracting value out of it so pretty neat insight out of this all right last part in this lesson and we’re going to focus on two examples very quick in how to use where within a query to filter properly and this is really important especially for the practice problems we have for this in order to understand that you’re filtering properly so let’s start easy first how can we filter before a window function well we could use this where statement here and it is going to apply before we actually invoke the window function itself let’s actually test this with a simple example so for this i’m querying to get the customer key and we’ve seen this windows function before of extracting the minimum year to get that cohort year based on that customer year i’m going to go ahead and run this and there’s trusty number 180 we see that it is in cohort year 2018 so we know the problem or the query worked out properly so let’s say we have a scenario where we only want to look at cohorts from 2020 onward basically we don’t want to put people in cohorts before 2020 we want to just analyze from 2020 onward this is a great example of filtering before a windows function in this case we can use the wearer and we’re going to specify that the order date is greater than or equal to january 1st 2020 now let’s see what happens with customer key 180 okay previously it was bucketed into as its first order in 2018 but now since we’re saying “hey don’t pay attention to that anymore i just want to focus from this point onward it getting reclassified as 2023.” so that’s how you filter before window functions the wear clause can be right there in the statement underneath the select and it’s going to get applied before the window function runs now contrast that with filtering after a window function in this case we need to do something like a ct or subquery we’re going to do a cte here we would need to get that cte and then from there filter so let’s continue on with this last example to show when we might want to use this so the first thing i’m going to do is i’m going to remove this where because we’re not going to do that anything with it we’re going to put this portion into a cte we’ll give it the alias of cohort and then i’ll do a opening and closing parenthesis to put it all within then we’ll do a select star and we’ll do this from cohort so now let’s insert our wearer and in this case right we want to filter after the windows functions specifically we want to remember previously customer 180 was in that 2018 time as its cohort year let’s say we just didn’t even want to look at any of the cohorts if they were broken into that cohort we don’t want to even classify them at all after 2020 so for this we want to specify where it’s greater than or equal to and the year in this case would be 2020 now what i would expect from this is that customer key once again is going to disappear also this from is messed up right here it’s not it’s from cohort okay let’s go ahead and run this and bam we notice now that 180 is removed from this because of how we actually applied this after the window function to remove those cohort years from the original purchases anyway wrapping your mind around how wear function are applied before or after window functions does require some practice so we got some practice problems for you go through and test out and get more familiar with it all right in the next lesson we’re going to be jumping into functions around ranking and that one is pretty exciting so we’ll see you in there all right we’re continuing on in this chapter of window functions now focusing on how we can rank different values and use a certain order to rank them for this we’re going to be covering three main types of ranking functions row number rank and also dense rank i’ll be explaining the difference between all these and we’ll be doing this by ranking customers based on how many orders they’ve completed so more as they completed higher the rank they are but before we even get into any of those functions we need to first understand how to use order by within a window function in order to get the correct ranking that we want now previously with sql you’ve seen order by used typically after the from statement to actually order your values but we can use it inside of the window function and what this can do is order our values within a specific partition that we’re running this window function on and order by as a reminder always defaults to ascending but you can specify descending so what are we going to do in well with the sales table you’ve already seen the customer key order dates net revenue what we can do is get a running order count and what this will do is based on an order so in this case for 180 it has it first order in june or sorry july so we have our first order and then in august on the same date we have two more orders so then it bumps up to three similarly for customer 387 they have four orders on their first day and then the next day or the next time they complete an order they bump it up to five so let’s build something similar i’m going to throw in a select statement along with customer key order date and then net revenue we’ve seen this all before this is going to be from our sales table i’m going to go ahead and run this just to see what it’s outputting okay sweet not bad let’s now just get with this a count of the orders based on using a window function based on that customer key so i’m going to insert a count function and we’ll just do it on count star we’re going to do this over and then i’m going to do a set of open parenthesis i’m actually going to indent this down to make this easier to read and for this we’re going to partition by customer key we’ll do by order by in a second i want to go ahead and actually see this first now we’ve seen this before but i want to call out some things specifically notice that the customer keys are different from before so it actually went through and ordered it with that partition by except it didn’t order it by or didn’t order the order date because this one for 180 in june or july is after the august anyway that’s something to note for later but we do find we get the correct calculations for this because 180 has three orders and we can see all of them here so now let’s insert in our order by i’ll put it right after the partition by and specifically we’re going to be doing it by that order date and this can control our row processing order let’s go ahead and run this bad boy and so inspecting this table we can see now that not only the customer keys in order the order dates in order and whenever we go through the count itself for 180 it has one and then the next time in august there’s two more so now it increases by three so our function’s working so for these aggregate functions like count average or whatnot it’s going to determine how values are accumulated row by row also i realize i never gave this an alias so i’m going call it running order count anyway let’s now use one more aggregate function to demonstrate this further specifically i want to do a running average of the net revenue so basically i after the first order i expect to be around 525 and then after these next two orders i’d expect it to be an average of all three of these and it’s going to do this line by line for this i’m going to insert a new line i’m actually going to just go ahead and copy a lot of this because the boiler plate code itself is going to be approximately the same instead of doing a count we want to do an average and specifically we want to do it on that net revenue so i’m going to copy that and then paste it into here we’re still going to be partitioning by the customer key and ordering it by the order date in order to carry carry out that rowbyrow execution and we’re going to name this running average revenue okay let’s go ahead and run this and now inspecting it we can see that yeah for the first one they only have one order next one of 180 the first order the average is in fact the same as it is but then when we move into the next round of orders it’s averaged among all three of these so for 180 at this point in their order history the average order revenue is around $836 with that knowledge let’s actually get into our ranking function specifically first we’re going to be focusing on row number and understand the importance of how it needs to interact with order by now in postgress they have quite a few different ranking window functions but in this lesson we’re going to be focusing on these top three right here row number rank and dense rank for row number this returns the number of the current row within its partition counting from one so for this let’s just label the rows in our sales table with a row number for this we’re going to select all the rows of the database and we’re going to say this is from the sales table since we’re doing all the rows i’m going to just limit this to the top 10 run control enter okay this is our values back let’s now get into using this to assign a row number at the front of here one quick note for this data frame that’s appearing here is this is the index right here but it’s not necessarily callable the 0 1 2 3 4 5 so we really can’t use that that’s why we actually have to generate this anyway after the star we’re going to start a new line cuz i want all the actual columns to or yeah all the different columns to appear for this we’re going to specify the row number and then run over and i’m just going to do an open and close parenthesis we’re not going to use uh a group by just yet i’m going to go ahead and run this and if we scroll all the way over we now have the row number right here 1 2 3 4 5 6 7 8 9 10 as expected and it looks like our data set is in order now i don’t really like where this row number is appearing i actually want it at the front of the data set and i’m also going to give it an alias of row num anyway let’s move it to the front so important thing to note is yes this does provide the row numbering but this actually does this in a well chaotic order and it’s not guaranteed that it is going to be assigned based on how the data set is in the system so i don’t recommend doing this i always recommend anytime you’re using any of these numbering functions to use an order by so what is this order going to be based on well every order has a unique order key and also line number additionally with this i’m going to also use that order date to make sure that we’re maintaining in that date order i’m not 100% sure if the order key may mix around in certain locations in time so for this i’m going to press enter and now i can enter in my order by and i’m just indenting this in to make it easier to read we’re doing an order by we’ll specify that order date first and we’ll just run it as is just to show this it’s still providing those row number in sequential order by order date but like i said we want to be very specific to make sure that it’s doing it correctly so what i’m going to do is also add in that order key and then that line number control enter and bam so this has what we need in regards to row number let’s now take this a step further combining it with partition by specifically let’s say that we want to start or have a number for like our daily order number and so every new day we want to start this over let’s look first what would it what is it at on the 2nd of january 2015 and we do this by inserting a where statement and this is saying hey filter this for where order date is greater than this this is only done just to look at that uh on the next day running controll enter and silly me already forgot what we learned in that filter lesson right whenever we’re doing the wear right here underneath the uh window function it actually will apply this filter data to there and so in this case it automatically did start uh numbering rig again at one but generally if i didn’t have that wear statement wouldn’t do it anyway not necessary for you to actually follow along and do this code i’m just doing this for demonstration purposes i put all of our original query inside of a cte called row numbering then i called it to look at the first 10 values in it anyway with this same queries below now i want to filter for the second basically where order date is greater than the first running this one we can see that that row numbering starts you know at 26 27 or whatever we want to do a new window function in order to assign a new numbering each day so let’s go back to that original query that we have remove that ct so we can make it easier actually to run through all this for this we now want to add a partition by right above the order by and for this we want that new numbering to start every single day so we’re going to give that the order date as the variable for this okay going to go ahead and run this and actually we’re not going to see any difference in what’s displayed here but we do see down here that in 420 that it only the numbers only go up to like 97 so it does look like it’s working what i’m going to do is i’m going to now just go ahead and copy this all paste that into that cte so we can demonstrate this again and then run this remember previously it was starting the numbering around number 24 or something but now on the second it is starting it at the first so this is working so let’s now get into comparing the three major types of function we uh using for this row number rank and dense rank note all of them provide big or an integer value from this they all have a different way of ranking depending on what your use case is so let’s look at a simple example for this so let’s say we went through and calculated how many o orders each customer did and this one up here they’ve had a total of 31 orders well we’re going to go through and rank them using row number and then rank and dense rank and then seeing how they actually differentiate from each other for this let’s first just pull in the information we need using a simple select statement we’re going to use customer key and then to count all the orders we’re just going to do a count star to basically count the line as an order and we’ll assign this as total orders we’ll get this from the sales table and we did an aggregation method so we’ll do need to do a group eye of the customer key so let’s go ahead and check this bad boy out and with this we can see the different customers and then the orders the or the total orders are all over the place they’re actually get where we need to order this but we’re going to be ordering it in our window function so i don’t want to necessarily do that until after our window function also this is just too many values i’m just going to do a limit 10 for the time being all right let’s get into now building our window function first with row number to assign a row number based on what are the total number of orders they have so for this i’ll start a new line we’ll insert our row number function we’ll do it over and then inside of here we don’t need to do a partition by because well we’ve already grouped by our customer key to find out what the totals are so all we need to do now is just put an order by in here specifically we want to order it by that count star and we’ll give this the alias of total orders row num real original i know all right let’s go ahead and run this all right bam what we have here is well not really what we want uh so these all have total orders of one and it is uh providing a total orders row number one two all the way to 10 but i actually want to be ranked from highest to lowest so as remember from the beginning we can actually insert in a descending comment right in that order by now when we run it we can see that we get this total orders row number and we’re starting at the highest of 31 all the way down notice with this when we get to numbers like 26 26 and 26 repeating it does assign a number of four five and six but technically these are all tied ranks and so that’s why we need to learn other functions like rank so i’m going to go ahead and we’re going to insert another line in here i’m going to copy this row number because a lot of this is just boilerplate that we can repeat instead of row number i’m going to change this to rank and then for this alias instead of row number num at the end and i’m going to add the i’m going to change this to rank running this now okay we got the total orders rank and if you notice from this one whenever we get here and we have repeating values of 26 this is going to assign at four four so if we actually see then when we get to 25 it assigns it the number seven so then we’ll skip five and six and now you’re probably like well loop what happens if i don’t want to skip five and six in my ranking method and i just want to continue on ordering from there well that’s where dense rank comes in so let’s once again going to go ahead and start a new line paste in our old code change this one to dense rank and then also update the alias to dense rank running this one i put a comma at the end of this don’t need to have it going to run again we now have this new one where okay we get here we have the repeating fours and then when we get to the next one it jumps or it stays uh consistent in number and jumps to five and then we have some uh similar numbers again of 24 and so these are all sixes so when i need to order something or number something it really depends on what is my criteria on whether i’m going to use something like row number rank or dense rank that’s why got some practice problems now for you to go through get more familiar with these functions in the next lesson we’re going to be getting into lag and lead which we’ve been doing functions that really look at only the current row it’s on but we can actually use functions that look either before or after certain rows pretty powerful all right with that i’ll see you in there all right welcome to this fourth of five lessons on window functions and in this one we’re going to be getting into functions like lag and lead these type of things in a window function allow us instead of looking at the current row to allow us to look at things like the row above it or the row below it so we’re going to be exploring these main functions and we’re going to be doing this first with a very simple example where we can go in and look at our 2023 monthly revenue and be able to evaluate our month overmonth growth because now we can look either before or after row and be able to calculate this from there we’re going to shift into our final scenario which is slightly more complex in order to analyze the growth of cohorts over the years and basically see how they change from year to year so what functions are we going to be exploring for this well there’s generally about five different ones that you can use let’s start with the easy ones first we have first value last value and nth value for first value it returns a value evaluated at the row that is the first window or first row of that window frame the last one does obviously the last one and then for nth value you can specify an an integer inside of there and it will return that row now for lag and lead they’re very similar in that this one you’re providing either a row that is lagging or a row that is leading but we’re already to get too deep into it we need to actually explore this with an example so what i want to do is first we need to calculate this we need to get in 2023 what is the monthly net revenue and then we’re going to be applying these functions in a window function in order to evaluate first last lag lead and try all these different ones out so let’s start simple with just building this query so i’ll start with the select statement and the first column we want is the monthly v variables so i’m going to use the two care function for this i really like the output of this it’s a lot easier we’re going to run this on the order date and i want it in this format we’ll give it the alias month next is we want the net revenue so we need to use the sum function for this as always it’s quantity time net price times exchange rate we’ll give it the alias net revenue we’re getting this all from the sales table and we’re doing an aggregation so we need to do a group by specifically we want to do the group by by month let’s go ahead and run this bad boy okay first thing i’m noticing is it’s returning it back but it’s ordered all over the place so i’m going insert an order by we’ll also do this by month and we want to analyze this for just 2023 so i’m going to insert in a wearer and we’ll just do an extract to extract out the year and analyze for 2023 so with extract we’ll specify the part from and then order date and we want this equal to 2023 it’s a digit so we don’t have to put that in a string and bam we have what we need now now we can actually start getting into running all these lag lead and stuff functions now because we’re going to be running window functions on this i’m going to put this all into a cte we’ll use the width statement we’ll give it the alias of monthly revenue and then i’ll provide some open and closing parenthesis and then i’ll do a select star from this cte and make sure it’s still outputs yep still good to go so let’s get into exploring these we’re going to start with the easy ones first like i said of first value last value and nth value so for this i’m going to specify first value and we need to put something in here specifically we need to put the the value expression that we want out of this in our case it is net revenue which is right here inside of our cte the next is an over and then inside of here we don’t need to do a partition by because it’s everything’s already grouped but we do need to do an order by specifically where do you what do you want us to choose the first value from in our case we want the first value based on the months not on the value of net revenue so i’m going to put month in here and we’ll give it the name or alias first month revenue going ahead and running this we can see that we do in fact get that first month into all these different columns let’s now look at last value let’s change the function name and then also the alias running this one now this one unfortunately whenever you read it you find out oh heck this is not the last month’s revenue for each of these and that’s because the actual syntax inside of here of the last value of how we need to order by needs to change specifically i’m going to delete it and put this in here and we need special conditions of to use rows between unbound proceeding and unbound following when we look at it we can see that it is actually the last month net revenue anyway just wanted to demonstrate demonstrate that because it takes a little bit more to fine-tune this one we haven’t covered un unbound proceeding or unbound following we’re going to be covering that in the next lesson so stay tuned for that one all right next up is nth value i’m going to go ahead and copy this first value and then insert it down here underneath and then change the formula name of nth value also for this we need to specify the integer expression for what number of rows after this to do so right now we’ll just do three rows down and we’ll call this third month revenue okay running this bad boy we can see that we get the third month’s revenue now we do have this nonvalue in here once again if you want to fix this which we’re going to cover in the next lesson we can insert this statement right here right after this running this we now have the third month filled in for all this don’t worry those all make sense tomorrow when we cover all those but at least you got the basics now let’s wrap up the simple example with lag and lead so for lag we’re going to be looking it’s going to be a lag so it will be the previous month we’re going to look for this returns values evaluated at the row that is offset rows before the current row within the partition if we want to it’s optional because it’s here in square brackets but we could put an offset integer similar what we did for the nth value and we’re to offset that we’re not going to do an integer for these examples we just want the previous and then use previous to find the next months so going back to our query i removed some of the other ones because i don’t they’re it’s getting too much in there we’re going to specify this as lag with the parameter of net revenue we’re going to still order by month and this will be the previous month revenue let’s go ahead and run this and inside of here we can see yep this is in fact the previous month’s revenue this does take a parameter like i said you can do that offset so if i wanted to do two in this case run control enter it’s now offset by two values before it goes down we’re not using that for our case we’re just going to go we’re doing trying to calculate eventually month over month so we’re going to leave it at one and run this okay let’s now use lead i’ll specify the formula first and instead of previous month it is going to be next month go ahead and run this and we can see that it is actually the next month’s revenue for all of these so you’re probably like luke what the heck does this even matter yeah you can find this out but how does this even help me as an analyst well let’s say that i wanted to find my month overmonth growth which is pretty common in the finance industry for evaluating your performance so here i have a chart that’s showing basically our bar chart of the different revenue each month and then the line chart is showing what’s happening what’s the rate of revenue growth month over month here we can see that in may we had a substantial increase in growth and that’s because we had a pretty low one in april anyway it’s pretty big predictor in the business world so let’s get into calculating this so in order to do this we care about the previous month’s revenue in order to calculate this so we don’t really need this lead function right here i’m going to go ahead and remove it so what we need to do first we’ll just take net revenue and subtract it by this value right here to get basically our change every single time and we’ll give it the alias monthly revenue growth okay now we can see from this that it is calculating that growth an easy line to see is row two we went from the previous month at 4 million to that month of 2 million so we lost 2 million yeah calculations are doing well okay we want a a rate of change because of this we need to actually divide this value of net revenue minus the previous month revenue and we need to divide that by our original value which is our previous month revenue so i’m going to go ahead and copy that again paste that down here let’s go ahead and run this and bam there we have it we are now calculating that rate of change and we can basically see that um for this for in march we had about a 50% reduction in revenue whereas in something like may we had almost 150% rate of growth just to be clear this is decimals so this is a the percent if i wanted to i could just do 100 times this value and then we can see this percentage a little bit easier with this so let’s get into example to show the benefit of these type of functions that can be used and previously in the second lessons of this chapter on aggregation we went in and covered what was the average lifetime value based on your cohort if you recall back your cohort you’re assigned a cohort or customers assigned a cohort based on the year of their first purchase and we notice this trend that following around 2016 we see the lifetime value drop during these and that makes sense because their actual total lifetime is slightly shorter but we did have this unexpected rise in from 2015 to 2016 anyway what happens if we want to go through and actually analyze what are these different drops between each one of these ltvs well we can use our lag function for this now in order to do this we need or the table we need to start is this we need the cohort year verse their average cohort lifetime value now we basically calculated this back in lesson two of this chapter on aggregation here i have the formula here and actually what we got to inside of the lesson was this final table which consisted of one ct and then another formula underneath it anyway i took it a step further inside the notes so for those that have access to the notes you can go right to it and all you need to do is copy this the other option just pause your screen and copy this in also this next example we’re getting into isn’t that long so you can just watch along if you don’t even want to do any of these so we need to run window functions on this after i’ve gone through and inspected this it’s a ct inside of another ct but then we have to do the select distinct because we have multiple different rows so i’m actually going to put this one also into a ct i’ll give it the name of cohort final i need to go ahead and put a closing parenthesis on this and then do a select star from cohort final make sure everything’s appearing just fine underneath here and it is okay now we can work from here to insert in and we want to create our first column to look at what is the previous year’s lifetime value so for this we’re going to be using lag i’ll do a select star to basically show both those rows and then do a lag and the value we want to put in here is what value we want actually to appear and that is the average cohort lifetime value we’ll do over and then inside parenthesis we’re going to do an order by and we’re going to order by that cohort year and we’ll give it that alias of previous cohort lifetime value and okay we’re seeing that it is the in fact the previous cohort lifetime value all right all right final thing i want to do here is calculate that percent change so that year overyear change or better that the year of the cohort over year of cohort change for this remember our ratio is a final minus original over original so i’m going to take that final value of average cohort lifetime value and then subtract our windows function so command c this command v and then from there i want to divide this by the original so once again paste in that previous cohort ltv and we’ll give it the alias of lifetime value change okay we’re going to go ahead and render this okay it’s not the correct value i expect this to be more of a a decimal number i think this has to do with basically my order of operations here i think i have that parenthesis in the wrong place i want to do the subtraction first and then after that the division okay this is giving it to us i want to actually see these as a percentage so i’m going to put a 100 at the front multiply this and bam now we can see our lifetime value change yeartoyear as expected we had a slight increase in 2016 and then from there it went down now if you actually go through and visualize this looking at this is the average cohort lifetime value uh in the bars and then the actual rate of change we can see that the rate of change actually is picking up so although i would expect it to go down i wouldn’t expect it to go down at this high of a rate so maybe something we need to dig into in a real life scenario if this was happening all right you got some practice problems now go through to get more familiar with handling these type of functions and window functions in the next and final lesson of this chapter we’re going to be going into further detail and syntax understand to be able to use like we demonstrated that last value function how it wasn’t working properly without further syntax specified to it we’re going to be going over that in there all right with that i’ll see you in the next one welcome to this fifth and final lesson on window functions where we’re going to be focusing on frame clauses now up to this point with window functions we’ve only really focused on two things of the window definition that’s the portion after the over and that is looking at how we can use partition by and order by but there’s actually one more thing to cover with this and that’s the frame clause this aspect helps control what amount of data we want to actually control putting inside of the windows function what do i mean by all this well as we’re going to be solving in this say we have something like the monthly net revenue this is from 2023 as you can see it’s highly volatile going up in february and then going back down and then going back up well this is where frame clauses come to the rescue what we can do is look before and after certain rows and in this case average them and so we could in in this case perform a threemonth running average to basically smooth out our line this is very common in business analytics especially with seasonal data that has these types of ups and downs you want to remove all the noise and actually be able to look at it more clearly so postcrest has some documentation that goes all into this but it gets quite complex so i simplified it in our notes so this lesson is going to be focusing on using the frame clause of rows basically looking at what rows we want to put inside of this window function now as hinted to this comes right after our order buy and we can either include something like the start frame or we can use that between keyword to signify a start frame and then also an end frame but what the heck are the start frame what’s this end frame well there’s five main things we could put inside of here and is the majority of this lesson of what we’re going to be getting into for we’re going to be able to see how we can use current rows preceding rows and also following rows don’t worry we’re going to break each one of these downs so you’ll be more than familiar with it by the end of this now with postgress you can also specify besides using row you can also use things like range or groups this i would classify into more of advanced sql so we’re not going to be really covering it in this course additionally i had chat gbt make this fancy dancy table down here and i want to just show with this range and groups also isn’t supported in a lot of other popular databases specifically mysql and sql server so i don’t really want to waste your time if you don’t have those type of keywords available to use that’s why we’re going to focus on rows anyway you need to learn anyway to be able to apply range and groups if you want to learn that later on now for this entire lesson we’re going to be analyzing our monthly net revenue similar to what we did in the last lesson because as you remember we had some unanswered questions on how to use some of those functions without what we’re going to learn in here we’ll get to that by the end anyway you should remember this query or have it in your system already right this goes through and gets not only the month but also the net revenue for that month pulls it from the sales table and it only extracts 2023 i just want to look at one year so there’s not a lot of data we’re messed with since we’re doing an aggregation function we need a group by and then finally need or buy because it gets all out of whack and you should see something like this looking at our monthly net revenue if we graph it looks something like this that we saw at the beginning goes up in february and then also has a strong dip down in april then returns back to normal we’re going to be working towards getting a running average with this but we first need to understand current row which the keyword of this one before we move for any further so i want to run a window function on this query so i’m going to put it into a cte we’ll give it the alias monthly sales and i’ll put it into parenthesis we’ll select both the month and also the net revenue and pull this all from monthly sales let’s go ahead and run this and it’s exactly the same thing that we saw before now we can actually run or use a window function here instead of on here like we said before can’t be using window functions with the group by we can it’s really complicated we’re not going to do it anyway we’re going to do the window function below specifically all i want to do is get the average net revenue for this month so basically repeat that same value so we’ll start by calling average on net revenue we’ll go over i’ll start a new line and indent down and then before we actually anything else let’s just look what it generates it’s going to generate the average across all 12 months so we want to do this or we want to order it by the month itself so i’ll do an order by and then specify month now running this query we’re still getting the average but it’s slightly different now for january it’s still the same and that it’s the january average but if we look at something like february it’s not only it’s getting the average based on january and february for march it’s getting it based on january february march anyway we want to control this average so let’s move forward we’re going to start by renaming this column so i’ll give it the alias of net revenue current because we’re about to use current row now this has the following syntax of rows and then the start frame we’ll eventually get to this one of rows between but we’re just going to start simple with this one first and then remember our start frame or end frame can be any one of the following we’re just starting first with just current rows where we then move into calculating the average or should i say the running average all right so i’m going to insert in rows and then from there current row now this is selecting to run this window function on this current row so there should be no difference between any of these promise you we’ll see more of an impact in a little bit but i do want to demonstrate how you can also write this rows between and we need to do our start frame so current row and our ending and we’ll do once again current row as you would expect it’s only looking at the current one so it’s going to be the same across all of these let’s get into our next keyword of looking at how we can use something like n proceedings looking at preceding rows or preceding values this is going to be the final table that we get in it we still have our monthly and our net revenue and what we’re going to look at in this case is just one row back and the current row in order to get the average right here at the first one we expect it to be the same as this one but whenever we get to the second row right here it’s going to look at this current one and also the preceding one and get an average so the average of uh 3.6 million and 4.4 million is around 4 million and so for this we’re going to use inside of our start frame we’re going to use n proceeding specifying n as a number so getting back to our query right now we have in there to do rows between current row and current row we want to go one row back and also look at our current row so i’m going to remove this portion and we’ll specify one proceeding oh not receding but actually preceding okay let’s go ahead and run this and as we saw that in that demo table we’re now taking the average of the current row and the previous row we get 4 million in this case so we can take any number of values also we could just make sure that this is actually working properly by putting a zero instead in here instead of one which means the current row and we’re getting all the same values as before but we’re going to change by back to one now i’m going to take a step further just for demonstration purposes you don’t have to do this portion but i went through and we did before we did one proceeding i wanted to see what it was going to look like for also doing two proceeding and three proceeding and i got this fancy dancy table and from there plugged it into chatbt to actually visualize it and what we can see from this is that with each preceding row that we include so we include more to take an average of it this line becomes smoother and smoother it goes from it starts with a darker line with just core net revenue and it gets lighter and lighter depending on the preceding amounts that we used i know there’s a lot of overlap here so i also took it like this and i graphed each one of them individually showing how over time it gets smoother smoother and smoother so now you’re probably like luke is this what you do in a real world scenario i’d say well not typically just the proceeding but i combine this with something like following and so we’d use values before and after the current month and get something like a three-month average which let’s do it now so just as a reminder it’s going to be of the syntax and following where you can specify the number and it gets that many number of rows after the current row so back to our original query that we’re working with instead of doing one preceding and the current row what i’m going to do is now change this to one proceeding and one following running this query we can see that all our values now especially even that first month is smoothed out because it’s not only taking the current month but the following month whereas something like june is taking not only may’s month but also july’s month and then averaging it together to get this value this i feel is more representic of what i’d see in the business world i went ahead and visualized it and this shows how this actually smooths out our net revenue line by performing this three-month average now you could take it up a notch and do something like a fivemon or even seven months uh running average but at that point i think you’re going to be removing a lot of key insights from this so i’m going to stop right here at that all right last two start frame and end frames to end with and this is unbounded preceeding and unbounded following if you notice carefully all they’re really doing is replacing the n with unbounded and this says hey we want to use all rows from the start or maybe all rows from the end so let’s actually just do just that in here we’re going to place one on these with both of unbounded and what do you think’s going to happen here well if we use unbounded on both it’s taken the entire window function or window frame into the account for printing this average so we’re basically getting the average of all these 12 months now typically when i’m seeing anything like unbounded used i’m usually seeing it with something like current row running this we can see that the first row is equal to basically itself and then as it goes along it’s taking into account all the values behind it along with the current row and it seems like the line is just getting smoother and smoother and smoother as it goes along so where am i typically seeing these unbounded parameters being used well if you remember from that last lesson when we were looking at same uh chart of the monthly revenue we were able to use like lag lead functions specifically first value last value and nth value let’s go ahead and run this we saw that for first value it actually did give us the first value but then the last value it didn’t work out properly it just gave us what the current row and then for the third most or whenever we did nth value and specified three it gave non values for the first two but then finally gave us the third value for everything else well this is where unbounded comes in so let’s fix these functions be able to do this and then i’m going to indent this down to make a new line to make this a little bit more readable and then from there i’m going to insert in our frames clause specifying that rows between and then rows between unbounded proceeding and unbounded and f unbounded following running this one now whenever we look at that last month revenue we can actually see that it actually does equal the last month basically we had to open up what it was going to look at for that window function by using this frame clause similarly we can do the same with that nth value i’m going to go ahead and just copy and then paste that right into here now running this and we can see that now that third month’s rent is appearing in every single line regardless if it’s before or after so bam we’ve now covered all of the major aspects of using window functions you now have some practice problems to go through and get more familiar with using these different frame clauses inside of window functions and then in the next chapter we’re actually be getting into and how you can install this database locally and run it locally so you can have a workflow that’s actually workable with that i’ll see you in the next one all right welcome to the second half of this course don’t know why i had to jump in like that wanted to have a dramatic entrance for some reason anyway in this chapter i’m going to be taking you through all the steps necessary in order to install postgress locally onto your computer get you set up with its editor of pg admin and then also get you set up with an even better editor of dbaver so let’s break this down in this lesson we’re going to be installing postgress or the database itself locally onto your computer we’re going to be downloading it from the internet and then it’s going to install postgress but also pg admin now pg admin pg short for postgress uh is the editor used in order to interact with postgress databases so anytime you need to start or stop the database or even if you want to run a query with it we can do that with pg admin which we’re going to demonstrate all during this lesson now if you already have postgress and pg admin installed you don’t need to do it again but in this lesson after we do the install we’re going to go directly into actually loading the database specifically our contazo database that we’ve been using in those jupyter notebooks once we get the database set up we’re going to do a quick walkthrough of the entire pg admin ui so become more familiar with it now there’s one major flaw with pg admin and that it only connects to postgress databases and after this course you may be proceeding on to learn other databases because of that we’re going to be installing in the second lesson dber now this is a database management tool so can only connect to different databases you can also run queries on it to see the output we’re going to be using the community edition of this and it’s free and open source and has everything we need to do to get started this is the most popular database tool that i know of so i’m super excited to use this and everything that you learn for using dbver can also be applied with other databases which dbaver can connect to one last note before we begin some of you may run into installation errors or other errors along the way i highly encourage you to use something like chatbt to help you out it’s a lot quicker than trying to post a comment and helping or hoping somebody else comes in to help you out now let’s say you can’t figure it out or you’re on something like a chromebook and can’t install postgress in that case you can continue to run all of the different queries in our sql notebooks they’re going to work exactly the same and have the same output but as far as interacting with the guey and stuff like that you’re going to have to figure that all out yourself because obviously it’s going to be different than deep so first we’re going to navigate to the download page from postgress and from there you’re going to select your operating system that you’re currently on i’m on a mac so select that we’re going to be using the interactive installer by edb so we’re going to select right here of download the installer which everybody regardless operating system it’s going to get navigated to this page where you can then select your operating system once again and download it for mac or windows you’re going to want to launch this installer if it gets a warning message it’s okay click open it we’re now going to walk through the actual setup wizard that it has included for all this we’re going to leave all the defaults the same core things that we do want to make sure are installed which are by default our postgress server and also pg admin which is the guey interface next is the password and i’m actually going to set this one to a really easy one of password now my database i’m not going to have any confidential material on it and this database that we’re installing isn’t secret at all so i don’t care if somebody else accesses it but it doesn’t really matter because it’s local anyway so other people can’t necessarily get to it unless i have it access to the internet anyway that’s a long story key thing if you’re only going to be doing this course with this feel free to just set it to password you should be okay but if you’re not set it to something else and remember it keep the port number the same of 5432 it’s common to postgress databases and we’ll keep the default local and we’ll go ahead with this setup’s complete i don’t need to launch this stack builder exit i’m going go ahead and click finish and i’m going to verify this is now installed by going to my applications folder under postgress 17 i have these different options of what’s installed we’re going to be opening up pg admin this is the guey interface for interacting with our postgress database it’ll start loading up with this open we have two main panes in here we have our lefth hand side pane which is our object explorer which shows all the different databases we’re connected to right now it’s asking me to connect to the server specifically postgress 17 which is the one we installed so i’m going to go ahead and put that password in of password and i’m going to click this of save password so as we can see postgress 17 is a server we have one server and then from there we have databases inside of that server right now we only have this one standard database that comes in all postgress uh servers called postgress we’re not going to touch this bad boy i don’t really care about it but we can see that hey there’s only one database in here we also have options to adjust login or group roles and then also table spaces we’re also not going to be messing with any of this this dashboard over here on the right hand side i find pretty useless it just tells me when i have interactions inside of my server and when it’s getting used so now let’s install the kazo data set locally for this we need the database file for it which is right here of this kazo_100k.sql file it’ll go ahead and start downloading you don’t need to do this but i went ahead and opened this file just to show you the contents and in this it walks through actually creating all the different tables that we need inside of our database along with loading in all of our data into it it’s pretty long file so let’s get this file into a database and for this we need a database for this so we’ll rightclick databases select create and database we’ll name this contazo_100k keep all letters lowerase the owner will be maintained as the super user of postgress so we can use that same password to access it we don’t need to change any of these other settings right here they’re all good enough we’ll go ahead and click save now we can see we have two databases underneath here and it automatically dropped everything down underneath here now if i go under schemas and then it has a public schema i can see underneath tables if i click this to actually drop it down to see any tables there’s no tables inside of it so this is where we need to actually load that sql file into this database first thing you need to know the location of where it is i recommend just put it on your desktop we don’t need it after it’s done so you can just delete it back inside of pg admin i’m going to rightclick that contazo_100k database and i’m going to go here to psql tool this is effectively like using a terminal to interact with our database i’m going to start this command off with a forward slash then i this is going to tell it to execute the script that we’re about to insert into this and so inside of single quotes i’m going to then put the file location and i’m going to insert that in of users luke bruce desktop and then the sql file itself make sure that it’s exactly right on a mac if you go to the file itself click option and then rightclick it you’ll get this option here of copy the sql file as path name on windows all you need to do is shift and rightclick the file icon and select copy as path okay we got it in going to go ahead and press enter it says that pg admin like to access my desktop yes i want to allow this and it’s going through here and actually creating all the different tables and altering them and i can see from this that we’ve put in it looks like six different tables and it tells us all the different counts of the rows that we inserted into those tables so i’m going to come in here and go to tables and try to see it there’s no tables in here but all i need to do is just right click ino select refresh so now it should have it and we can see scrolling down the tables we have six tables inside of here which from this menu i can actually dive into in the case of sales actually dive into the different columns and everything else that has associated with this i can also just do a quick check of this by rightclicking something like the sales column going to count rows and it tells me at the bottom there’s over 199,000 rows in this sales table but how can we actually query this database of contazo 100k well first we need to make sure that it’s actually selected and then come up here and select query tool we can also see that they have a shortcut of option shift q it opens up in a new tab right here i see these other tabs here if i don’t want any of these other tabs i can go ahead and select x and close it out up here at the top it tells me which database i’m connected to if i had any others i could switch it right up here we have our query window which i’m going to put in a simple command to look at the sales table and the top 10 results to run this i’m going to come up here and select this play icon for execute script or i can press f5 and all the results are displayed here below i also have this scratch pad over here on the right hand side so if i don’t have queries that i want to keep track of i can just put it over on the right overall i don’t find myself using it that much other key features of this area are you could open a sql file right inside this window or if i want to save this i could save the file we also have options for explain which we’re going to go into more detail in some upcoming lessons now down here at the bottom we only have the data output but also any messages and notifications inside of that output they actually have a few unique capabilities with this in that you can copy any of your different exports of data out of it if you want to put in gbt or something let’s say we have a more complex query that actually does some analysis such as this one right here that looks at the total yearly net revenue well we can not only save results to file but also we can graph and visualize it right here i select line chart and then we want the year for the x-axis and then the total year net revenue for that y-axis and then select generate and not too bad to actually get into visualizing queries pretty easily in this last thing to note with pg admin is i can also do things like view the erd or the entity relationship diagram for the database by rightclicking it select erd for database and with this this is showing our sales table i can scroll in through this and actually see the sales table along all the different keys and columns in it and with this table how it’s connected to all these other different tables in it so a great way to visualize your database and tables that you’re working with all right we now have some practice problems for you to go through and get more familiar with this pg admin guey like i mentioned at the beginning we will be transitioning next to dbver but i do find myself from time to time having to jump in and use pg admin so it pays off and understanding the basics of this tool that’s why you got those practice problems all right with that i’ll see you in the next one welcome to this lesson on dbaver in this we’re going to be walking through setting up and getting dbver connected to our contaza database first thing we’re do is going to download the community edition of dbver which is free then from there walk through the steps necessary in dbver to connect to our postgress database and then finally once we have that set up we’re going to actually do a walkthrough of the dber ui understanding how we can run different scripts and how we can set up our project inside of it all right with that let’s get into it all right if you navigate over to dbeaver.io this is the homepage of dbaver community this covers a few details about the tools you can read further specifically dbe community the edition we’re downloading can connects to a variety of databases and has all these different editing and viewing options it’s by far talking to all my data analyst friends and also looking at the research it’s the most popular database editor so that’s why we’re using it now dbver needs to make money like any community so they also have a pro edition i’m going to go ahead and click this you don’t need to and so with that they have a few different edition editions that you can get and use some get pretty pricey if you’re a business but as far as the basic sql and coding that i run i don’t ever need the features that are inside the light enterprise or ultimate edition i can get it all done with a community edition but if you come a power user highly encourage you you buy a subscription because you support building out dbver further all right cool story luke let’s actually now get into downloading db and close the download and then from there you select what operating system you’re on and install it i’m going to be going through with mac windows is going to be very similar so i’m not going to cover them separately after your installer file loads you should click it and open it up on mac it’s pretty easy all you have to do is drag the beaver over into your applications folder and now it’s here i’ll go ahead and open it up if it asks if you’re comfortable with opening this app up yeah we know where we got it from i’m going to open it with dbaver opened up and launched um you may notice first that mine may be dark and yours may be white i have dark mode enabled on my mac so i guess it automatically picked it up change it to dark mode anyway it says “hey do you want to create a sample database that can be used as an example to explore basic db or features?” we’re not going to do that we’re going to just install the contazo data set and then i’m going to take you through this so it should have immediately popped open with this of a connect to database and now we’re going to get into installing the database if this select database didn’t pop up that’s okay there’s a few different ways you can get it up and we need to go through it anyway one other thing before that it does have this popup that says “hey do you want to share your data in order to improve performance i’ll leave it up to you on whether you want to do this or not.” so to create a new database connection you can either go up to the file menu and go to databases and select new connections or you can just come right here to this fancy dancy icon and select new database connection now this is one of the reasons why i recommend dbver so much is because it connects to a host of different databases and so that way you can connect all your different ones that you’re working on as a data analyst so in our case our canazo is a postgress database we’ll select that now we need to go through and fill out the connection details we’re going to be connected by host specifically our local host so it’s locally on your computer the database name is not postgress it’s the contazo 100k make sure you spelled exactly the same as what’s appearing in pg admin next we’re going to move into the username which we maintained it as postgress and then the password if you named it like me the password is just password all lowercase i’m going to leave save password enabled because i don’t want have to log it in every single time from here it’s already picking up that we’re using postgress 17 and everything else looks good let’s go ahead and test connection in my case it’s saying that the postgress driver files are not installed we need to install them basically like if you install a printer into your computer or attach a printer to your computer you have to install driver files to attach to it so similar here nothing wrong with this we’re going to go ahead and download with that we get our test results back and it says that we are connected now if you are not if you have issues with that one check all those credentials make sure they’re correct but two what may happen uh to you is that your database may not be started and so you may need to open pg admin and actually open it all the way up to the kataza data set and make sure that it’s actually running on your machine typically for both mac and windows your postgress databases should start when you restart your computer so you shouldn’t have to do this but you may have unintentionally disabled this feature and so you may have to restart anytime you restart your computer with all the credentials put in and the testing of the connection set we’re going to go ahead and select finish so let’s now walk through dbver and get into understand the ui and also running a few different files okay so we have this pane right here on the left hand side and that is our database navigator also it holds our different information on our projects which we’ll get to projects in a minute anyway this has all of our database information in it if we want to see it specifically underneath database navigator if for some reason this disappears like i accidentally close out of it you can go into the windows menu item and from there show the actual view of database navigator pop right back up so what’s inside of here well very much similar to what we saw in pg admin we can see all of our different databases in here we have our contazo 100k database we also have these folders on administer and system info these are ones that i’m using less i’m typically staying inside of here specifically inside this contrao 100k go into schemas under public because it’s the public schema we care about and i can go into actually viewing all our different tables if i drop something down like the sales table i can see if i wanted to go into all the different columns in it along with what is the data type of those columns i can also see a host of other information like foreign keys and whatnot anyway one thing that you may have noticed about this is that there are numbers over here on the left hand side these aren’t the number of rows but instead if you hover over it you can see that it tells you how much disk space that that specific table takes up so you can get a general idea of how big these tables are just on dispace alone so i can see that sales and also the customer table specifically are pretty big relative of course this is actually a pretty small database now what i like about tools like dbaver is how easy it is to dive into these tables without having to write a sql query specifically if i wanted to see what was in the sales table i can rightclick it and then just go to view table now this side is the database editor and it actually has a tab view it’s like i could do something like this open also the currency exchange and it has multiple different tabs that i can cycle through now with this i can view a bunch of different things underneath properties i can look at all the different columns foreign keys constraints whatnot next up is data i can obviously look at the different columns inside of here and scroll through it a lot easier similar to excel spreadsheet and then the other one is the er diagram or the erd and this shows how your tables are connected all together i actually feel compared to pg admin this one is more realistic and shows how they’re all connected whereas i don’t know if you remembered from pg admin but they all like connected into a single line and went all over it was a hot mess so db does a little bit better at this anyway the view that i’m typically looking at most is this data one right here and i can look at as a grid or also as a text text i don’t find very useful at all um except if i need to copy and paste it grid is mostly where i’m staying now this guey in here has a lot of different options that you can use to interact with these tables and view them specifically you could enter a sql expression to filter it down you could also actually put in custom filters in here to filter it down down at the bottom we can do things like add rows remove rows typically i do this with sql i’m not going to mess with it here in dbe can also cycle through the different pages and whatnot one that i do find useful however is this of export data and anytime you have any of your data that you have and you want to get it out of here you can put it into a variety of different sources typically i’m doing something either of exporting it to a csv or export it to sql which will make it into a sql insert statement all right so enough of that let’s actually get into setting up our project folder i don’t need these two tables open up it’s also asking me if i want to save these changes in the data set database i didn’t really change anything or i don’t want to change anything so i’m going to click no so we’re going to be creating a project folder i’m going to click projects right here in order to be able to save our sql files if we want to as we go along right now we just have this general right here which has bookmarks dashboards diagrams we can also see it right below here i don’t really care about the general one i want to now create a project specific to this course that we’re working on so i’m going to come up to the top right here and select create project and i’m going to call this intermediate sql project real original i know with this project i’m going to leave it to the default location which is inside of the dbe folder i could uncheck this and then re change that to wherever i want it to be just so you’re aware i’m going to leave it in the default location i don’t want to add the project to a working set so i’m going to go ahead and select finish so now i have this intermediate sql project and my general my intermate sql is my main project or my active project so i’m actually going to rightclick it and say set active project and then it should shift to bolded additionally if i go back up to windows in the file menu and go over to project explorer i can have that now appearing below if you didn’t close it out it was general probably it should have switched to that anyway i like this type of view because now i can switch between them but if you notice the database navigator we now don’t have our sql database in there anymore so what we can do is go back to projects and it actually makes it pretty easy in here right here under general i have the kazo but what if i tried to click the connections of the intermediate sql project there’s no database that it’s talking about so we loaded the database into general we want it to move it over down here and so bam now when i open this project folder which it’s the de uh the default one in this case now it’s inside of here in the database navigator as well so this is pretty neat of how i can keep this all grouped together in a single project now by default there’s four different folders in here they all should be empty of bookmarks dashboards diagrams and scripts bookmarks are just as they imply bookmarks if you have something that you frequently go to you can just put it there in this case let’s say i frequently go to the sales table i can stick it in bookmarks and now makes it super easy anytime i need to go to that just click on it and bam it appears right here next we have dashboards and nothing in this but you could create a new project dashboard if you remember from pg admin they had a actually a default dashboard that shows all the different sections transactions stuff like that i’m not a database engineer i don’t really care about all that so we’re not using that next are diagrams if i wanted to i could create a new erd i could call it contazo erd now this does have our five core tables in here but also it has a host of other different tables that just come natively inside of a postgress database whenever you install it so they’re all going to be there if you make it this way you can filter down we’re not going to go into that all right the last thing is scripts how the heck do we create a new sql script well you can do this from sql editor in the menu or just come up to the top here and select open sql script now there’s a few different options that popped up here since we’ve done this first it tells us what is the active database and it tells us what is the active database schema so this is especially important when you’re working with multiple databases to make sure that you’re running the queries on the correct database now notice whenever we open this up inside of here we also the script itself we also have a new script underneath here i’m going to go ahead and minimize these we don’t need this anyway we have script inside of here if i wanted to i can rightclick it and go to something like rest rename and then we can name it appropriately like this is just a test script and i’ll make sure that’s a sql file press okay and it’s since been renamed so let’s run our first sql query we’re just going to make a simple statement of we want to select all columns from the sales table and we want to limit this to 10 results if you notice this i was typing in all caps as i was going along and then it made it lowercase after we’re going to fix that in a little bit anyway if i want to run this single query on a mac i’m going to press command enter on windows i’m going to press control enter also if you forget you can just scroll over these icons and click it and it also gives you what the shortcut is now similar to what we saw before with how we can view different tables and outputs i’m going to have a tab here and then underneath this i can actually explore it in different ways with text or grid i can also cycle through it if i want to export the data so a lot of different options to manipulate this and dive into all of it let’s run a slightly more complex query to just demonstrate the the power of this i’m going to make this into a cte and then we’re going to run a query on that cte so i’ll enter this down put in a width keyword we’ll call this sales copy because that’s what it’s going to be give it the alias of as and then open parenthesis now i need to clean this up i like indentations and things like that i can actually highlight this all rightclick it and then go into format and i have this option for format sql on a mac the shortcut is control shift f so i can actually just do that instead of doing control shift f and it makes it slightly more readable although it didn’t end in anyway as always with any ct i’m going to go down here and do select and star and then we’re going to be doing this from the sales copy table above now if you notice this it automatically gave me this error message saying “hey sales copy is not located above.” and actually if i even tried to run this with this by pressing command enter um it tells me sales copy doesn’t exist but i can clearly see that it’s up here that’s because dbaver automatically treats any blank spaces or blank rows as a endline delimiter basically treats it as like a semicolon at the end so what we need to do is subtract that out of there and i’m still getting an error message i don’t know why i am but oh it’s running now and it cleared okay i just had to run it once anyway i don’t like those two things right now i don’t like how it’s automatically making everything lowercase and i don’t like that it automatically gives me this error message when there’s just a space in there so we need to change some settings before i want to proceed on if you’re on a mac you’re just going to select dbeaver up in the menu and select preferences on a windows i think you’re going to select file after you select settings this preferences window will open up and this allows us to go in and actually control things inside of the editor itself i want to control things in the sql editor specifically for the formatting right now the keyword case is set to default which is lowercase i want it to be upper so i’m going to change it to that you can also control your indent size down to like two i like bigger indents so i’m going to do four also if you notice previously it wasn’t indenting in those things that are in parenthesis i like that so whenever i click this it does indent it in so i’m going to have that selected as well last thing to uh update is under sql processing and moving this over in this we have blank line is statement delimiter always we remember sometimes we may have blank lines in there i don’t really like this setting so i’m going to change this to never you can also change it smart but i’m not guarantee it okay we’re going to do apply and close now i’m going to select all of this press control shift f and it formats it exactly like i like so basically it changed all those keywords to uppercase and indented it in like i like not too bad now that was just one query i could put a semicolon in here and then let’s say i wanted to do another query on top of this i can keep it in the same script some things i didn’t call it before but anytime we’re typing any words you’re going to have this autocomplete and it also tells you what is going on there similarly if i’m doing something like a function count in this case it tells me hey it’s a built-in function in the database i can use this and then after i insert something like from it also it automatically knows hey he probably wants to put in a table so i could put something like customer in here now i have multiple scripts inside of here so what i could do is if i just want to uh enter this script i’m going to go ahead and close this out down below if i just want to do that script right there i’m going to press command enter and it’s only going to run the one script tells me there’s 104,000 rows now the other thing i can do let’s say i want to run all of these scripts they have this icon right here for execute sql script for me it’s the shortcut of option x on windows i believe it’s alt x i can go ahead and well we’re going to close out of this first we’ll select here and press option x and it automatically opens up each of these in different tabs now one thing to note is like okay how do i keep track of what are all these different queries that i ran right here well you can put a brief comment up at the top using two dashes and we’ll call this one sales copy and then the one at the bottom we’ll call this customer count now running this again pressing option x it prompts me hey there are three unpinned results tabs do you want to close these tabs before executing the new query i want to do this all the time if i’m running a new query i just want to see the new results so i’m going to say “hey don’t ask me again.” and i’m going to say “yes i do want these closed.” and that didn’t work because i actually told you wrong we need to you need to actually specify that this is a title so you specify title colon and then whatever it needs to continue on after that for what the title is i’m going to do that for both of these press option x and now both of these are named both below this is very convenient when you have multiple queries and you’re going to have obviously multiple different tabs also there is this one of the statistics tab basically just tells you the statistics that it ran two queries how long it took and whatnot so now with this test script that we’ve created if i wanted to i can see that it’s not saved because it has an asterisk i can go into file and select save or press command s or control s and the asterisk went away and now i can close it and if ever wanted to go back to that certain script i can just pop it up here and run it as necessary bam so now hopefully you’re follow along and you went through and installed dbaver because you need to do that unless you plan on using uh jupyter notebooks or collab to run the future queries anyway we now have some practice problems for you to go through and get even more familiar with dbver of all these different settings and actually getting familiar with running sql queries with that we’ll be jumping into the next chapter on building views so that’ll see you there welcome to this chapter on views now views only takes up really a small portion of this the majority of this chapter is going to be an intro to the project using views now in this we’re going to be going through three lessons in this lesson specifically for this video it’s going to be an intro to views how to create views how to delete them how to manage them and why they’re so important in the second lesson we’re going to be using that view that we’re creating in this lesson in order to analyze it further and answer one of our second questions in our project now our project in total has three questions which i’m going to showcase here what we’re going to be doing in a little bit and you may be like luke what happened to question one well question one we actually answered or start to answer it earlier in the lesson we’re just going to be building on it further in some future lessons don’t worry i’ll be getting you up to speed okay and then the third lesson that we’re going to be getting to in this is actually installing vs code which is a code editor that makes it super easy for us to build up our portfolio project and then share it onto the internet now before we get into views i want to just showcase what we’re going to be building in this project specifically we’re going to be sharing this to your github profile and it’s going to detail everything that you’ve done now if you’re not familiar with github this is a location that you can store and also share or collaborate on files here in this uh menu area this shows all the different files in this repo as we can see we have some sql files and then a readme which i’ll discuss more in a little bit and then we have like something like this which is an images folder if i click on it i can see that they have something like image inside of it anyway getting to that readme the readme on the front page of a project is going to be displayed right below it so here i can go through and actually document all the different analysis that i’ve done so if i have some employer interested in different analysises that i’ve done they can come to my github and view all that here and now you may be like luke why the heck do i need to install vs code you already had me install dbaver what the heck am i doing with this well vs code if you’ve taken my basic course you know is really powerful not only in writing sql queries but also in other coding projects like using python or whatnot anyway what the special use case in this project is is actually building our readme here i’ve typed out all the different portions of the readme and if we actually view it here i can see it all dressed up on the right hand side how it’s going to appear on something like github unfortunately dbaever doesn’t have these capabilities along with the fact that i can also go through and push this and put this onto github right here from this guey so loads of benefit those that have come from a basic sql course you’ve used vs code you’re familiar with it there’s not going to be a lot of stuff new that i’m covering here you’ll probably be able to even skip this lesson on vs code so let’s get into views well first of all what the heck is a view it’s a virtual table that allows us to show the results of a stored query in it for example we’re going to be going through in our next example and creating a view you can find underneath the views folder underneath a public schema and we’re going to create this one called cohort analysis whenever i click on it it’s that virtual table so this is a has all the different results for a certain query specifically i can go here under properties and look under source and actually see what was the sql query taken to actually generate this virtual table and so with this virtual table in this case it’s called cohort analysis i can open up a script i go ahead and clear all this out and say i want to select all the rows from the name of the view cohort analysis is even appearing right here telling me that it’s a view when i run it i get all the results of it below now views are super important and are necessary to level up your sql skills they allow you to or basically prevent you from having to go through and write the same query over time and time again because you know what happens whenever you have to write the same query over time and time again you’re eventually going to make a mistake with a centralized view this prevents that and also ensures that if you have this dedicated view that any other queries that depend on that view will get updated if for some reason you have to update that view anyway i’m getting ahead of myself what’s the syntax for this all it is is we need to use the keywords create view give it an alias and then provide all of our different sql that we had below it to actually go into that view so let’s go in to create our first view i’m actually going to go ahead and delete this view that we’re going to create because you don’t have it yet for this just open up a blank script in this we’re going to do a simple query that allows us to get the daily revenue for this we’re going to use the order date and then also we’re going to use the sum of quantity times net price times exchange rate and give it that alias of net revenue this is from the sales table and we need to perform a group by since we did that aggregation all right let’s go ahead and run this bad boy and it looks like it’s done all correctly one thing to note i didn’t filter this or put this in any order one thing to note is you can actually do that in here i can just click one of these filters and say hey order by we’ll say in descending order and it shows me okay we start in april 2024 and go backwards with the total revenue pretty neat all right so this is the view or that we want to create so let’s create it using that syntax specifically i specify create view give it the name of daily revenue and then just use as don’t need to put this in parenthesis i’m going to go ahead and run this pressing command enter and you should get something like this at the bottom telling me that the query is in fact finished now if i go to views there’s nothing there what i need to do is i need to actually refresh it you could do this by right-clicking and clicking refresh or you see the shortcut right here of f5 i’m just going to do f5 so now in this case that daily revenue is there i can double click on it open it up it has a few different tabs underneath it like i said the properties underneath it so we can see something like the source which gives us that query that we needed to create the view so we don’t need to save our query separately it’s right there it also shows our data and then finally our erd in this case it doesn’t really connect to anything else just its own table bam that’s it now if i wanted to access this view all i have to do is just do select star and specify i want this from daily revenue and since i’ve put that semicolon in the last one it’s only going to run this one when i run command enter now all the results are appearing below okay let’s say i’m done with this view or i don’t need this view anymore there’s a couple ways i can get rid of it i can rightclick it and just come down here to delete it’s going to then prompt me are you sure you want to delete this view of daily revenue and it asks do i want to cascade delete basically if there’s other views based on those views it’s going to delete all of those as well so you need to decide whether that’s applicable or not and then click yes or no or not we’re not going to actually delete it via that method and i’m just going to confirm it’s still there by refreshing this and showing that it is in fact still there instead we what we can do once again put a semicolon i can do something like drop view as a keyword and then specify that view of daily revenue okay let’s go ahead and run this pressing command enter it tells me underneath it was completed pretty quickly and coming over here pressing f5 we can see that view is no longer there very important note is that deleting views is permanent you can’t recover it once you do that so make sure you really want to in fact delete that view now that we got the basics of views let’s actually get into creating the view needed to answer a few of the different questions we’re going to be answering in our project once again a reminder we’re only going to be answering three questions for our project and we’ll be working on that second problem in the next lesson anyway you haven’t created this yet but this is what we’re going to be getting to eventually and like i said we’re going to have our view in here our different sql files to answer our three questions and then our readme this create view is what we’re going to start working on in this lesson we’re not not going to necessarily finish it in this lesson we’re going to finish it in the text cleanup lesson but we’re going to get a little bit of a start so what the heck does this view actually provide us that we’re actually going to use so we’re going to be diving into shortly a more advanced cohort analysis than what we’ve done previously and we need a table a view if you will to help us out and speed up that analysis specifically this table is going to be basically broken down and aggregated to provide us key things about a customer specifically when were their orders how many order they had when was their first purchase date what cohort they fall into and then additionally some customer information from the customer table this is going to be super helpful especially for something like total net revenue which does that quantity times net price times exchange rate it’s just already there i don’t have to worry about the calculation everything’s there so let’s just start building this view and we’re going to be doing this by just checking out our query we’re going to start with the sales table only bringing in the information we need first i’ll do a select statement we’re going to do select starf right now and then we’re going to be coming from sales now we’re actually going to be doing multiple tables in this so i need to go ahead now and i’m just going to add this alias of s i can also press tab and it adds that and then running this we can start picking out things that i want out of this with this actual table below i know i want the customer key along with the order date as always i want that total net revenue so i’m going be doing quantity time net price times exchange rate and giving it the alias of total net revenue we’re going to do one more thing also to get a count of the number of orders and we’ll do this off of the order key now because we do that aggregation we got to do a group by and i’m lazy i’m just going to go ahead and copy this up layer and place it below all right let’s go ahead and run this looks like i got a typo over here look at this syntax highlighting helping out to figure that out running it now bam we got our results that we want below everything look like it’s aggregating correctly now with this table i also think want things like the first purchase date and the cohort year this is going to take window functions going to do remember i don’t want to put that in a group by so we’re going to need to create a cte and then do it anyway what i’m trying to get at is instead i’m going to move over here to the customer information and we’re going to extract some key customer information to put into our source table so we need that from our customer table because of that i’m going to do a left join which allows us to keep all that information from the sales table and thus attach any related things from our customer table attached to it i’m going to give the customer alias of c and we’re going to link this on the customer keys of both tables i’m going to just run this to make sure we have no issues okay it’s running just fine so what information do we want to add on i’m going to do a c do star so we can add all of it on we’re going to actually refine it down and it’s telling me i need the c.customer customer key in the group eye we won’t have to keep this but this will just help clear up this error that i’m getting from using that let’s try that again all right so now scrolling over we can see that we start to have the customer information in here so i want things like the country full the customer’s age the customer’s given name and then also their surname and that should be it and now we need to put this all in the group eye because remember we’re doing an aggregation right here so i’m going to come down here and actually put that underneath here clean this all up and then we don’t need this customer key anymore i don’t believe so i’m going to remove that now let’s try to run this query make sure it goes and everything looks good we have all those different columns in it okay now what we need to do is extract out for all these different customers here what is their cohort year or the year of their first purchase so what i’m going to do is put this all into a cte i’m going to indent this over and then also space it down so we can put that width and i’m going to give it the alias of customer revenue assign it as do an opening parenthesis and then finally a closing parenthesis then to make sure that this is all correct i’m just going to do a select star from our customer revenue running this we can see that okay it is providing the exact same information that we had for run a good path now we need to do window functions in order to basically use that order date to get what is the minimum order date for a customer in order to assign that cohort year so for this i want everything from our customer view table i’m actually going to give it the alias of cr and i’m going to do cr.star star and then i want to get that minimum order date so we’re going to do minimum specifying our order date there’s a window function so we’re going to do over and then we want to partition it by the customer key so like 180 here we want to look at that and see what is the minimum of this and then we’re going to give it an alias of first purchase date okay let’s just go ahead and run this to see how it’s doing and we can see so something like rows two and three we should see for i got to expand this out we should see the minimum order date is 2018 in this case also 2023 so the cohort year for this or the minimum purchase date should be 2018 which it is now we can go ahead and build another column for cohort year and all this is going to be is just a copy if you will but we’re going to be using extract and then with that we’re going to just be copying the contents above of that minimum order date pasting it in here and then giving it the alias of cohort year let’s go ahead and run this rushing too fast i realized okay i have to extract something from that window function right i have to extract year from the windows function so now let’s try to run this we can now see that for customer 180 and rows two and three it is in fact the cohort year of 2018 this is good so now let’s create this into a view that we can then reuse we go ahead and enter a line down here we need to use these keywords of create view we’re going to name this cohort analysis and then once again we’ll use that as we don’t need to put it all in parenthesis though okay let’s go ahead and run this pressing command enter and it ran and super fast the view is if it’s not appearing remember we need to run f5 and now it’s appearing underneath here shows all of our different columns in here on the dr diagram also on the data tab so now with this analysis it makes it super simple what i can do is just create a new well i need a new script so i’m going to say new script here let’s say i wanted to analyze something like the total revenue per cohort super simple now to do with this view i specify obviously that cohort year the sum of our total net revenue and then we want to do this from our actual view which is our cohort analysis we did an aggregation so we need to do a group by specifically on that cohort year running this we can see our different results i didn’t do an order by i’ll just actually use this and order in descending order and so now in a super simple query i can get that at a lightning speed because i don’t have to do all that other analysis that i did before in that view because it’s already captured before we wrapped up this lesson this is a future loop as you can tell i’m in a different flannel we made a little bit of a mistake in our view specifically with naming a column that with the number of orders i didn’t give it an alias what do i mean by this okay going into cohort analysis anytime i want to use it i’m going to press f5 to just make sure that it’s fully up to date and if we go into the data we can see that everything looks like it’s fine except here for this column this is the number of orders but we left it unfortunately as count and that’s not a descriptive name we really need to change it to a more descriptive name and so this problem actually comes up quite frequently so this is actually good use case to go through anyway as remember when under properties and underneath source we can see all the different code now unfortunately with this command right here create or replace view previously we just saw create view create or replace allows us to replace it if it already exists now unfortunately i can’t come through here and update count here as num orders because that’s what the alias i want it to be and then this countdown here as numbum orders and then if i wanted to run this if i clicked save down here it allows me to say hey do you want to execute this it says cannot change the view column count to num orders instead you should use something like alter view or rename column to change the name of the view instead now alter view is a great thing to know of and what it can do you can go through and add additional columns remove columns and in our case rename columns so we’re just going to use this syntax to rename it but that’s only going to be a partial solution we’ll see so we’ll use the keyword of alter view we’ll name the view itself of cohort analysis and then we’ll use rename column also they have the syntax highlighting saying that it’s wrong because the table reference expected don’t worry about this it’s actually a f a false warning it’s not correct and so what column do we want to rename we want to rename that count and what do we want to rename it to is numbum orders now going ahead and run this command enter looks like it ran fine when we come back over here we can see that has a star next to cohort analysis that means it updated so we need to press function f5 we need to select inside the database navigator sorry press function f5 to make sure that it updates and i would actually recommend just closing out of the old one because we’ve changed some properties in it if you didn’t do it already and so we don’t want to mess with this we want to see what the newest one looks like so it says hey do you want these changed to persistent database no i don’t want them to so clicking cohort analysis again to get the newest up to date we can see that okay it doesn’t it didn’t change count up here but it did change it down here to give it the alias as numbum orders and i’m a perfectionist and also this is just good practice in general i want to change it in both locations in order to do that we actually need to drop this view like dropping a table and then create this view again so what i’m going to do is just copy all of this code we don’t need this alter view script anymore i’m going to go ahead and paste that in here remember we want numbum orders right here and then down here since we’re actually using it we can actually just remove it now before we run this create or replace view we need to actually drop this view this one’s simply written drop view and then we list the view name of cohort analysis i’ll put a semicolon after this and then i just want to execute this entire script right here so i’m going to press option x and it said hey it ran those two queries and it got done with it once again i’m going to close out a cohort analysis just make sure we have this select inside of here press function f5 and open up cohort analysis looking inside of our source and we can see that it updated numbum orders in both locations so crisis averted with getting that column up to date and keeping our query concise all right now we have a few examples for you to go through and get more familiar with creating views we’ll be using some of the previous examples that we’ve done in previous lessons in order to build views with so you can reuse them in the next lesson we’re going to be building further on this view that we just built in order to answer that second question in our project to further analyze the cohorts all right with that i’ll see you there welcome to the second lesson and in this one we’re going to be diving into a question for a project specifically how do customers in a particular group generate revenue regarding the particular group we’ve broken it into groups before doing cohort analysis is what we’re going to continue on from this now spoiler alert for this analysis we’re going to be looking at the different cohort years and at the customer level seeing how they spend money specifically seeing how they spend over time if you will generally it’s good practice to have customers spend more because it means more money and so we would expect that over time a company would learn and be able to extract more value out of customers unfortunately we find out just the opposite so let’s quickly reexamine what we’ve previously done on cohort analysis i’m not going to walk through this entire query we did this inside of our window functions chapter and with the results of this query we were able to plot out and see how what is the impact of a cohort on future years total revenue so as expected net revenue is going up and there’s contributions to these net revenues every year from previous years specifically members of previous year’s cohorts because your cohort year is based on your first year purchase so honestly this didn’t really uncover a lot for us does it tell us really that much we went even further and also did an analysis looking at the number of total customers and from this we saw that it went up as well once again not a lot of insights from this so what do we need to do well using our previous view that we created in the last lesson we’re now going to take that a step further and we’re going to analyze for the total revenue and the total customers but then finally get what is each individual customer’s revenue well on average at least so let’s jump into building this query for this we’re going to be using that cohort analysis i can dig into it and see that it contains all the same values that we did in the last lesson but from this i want to get based on the cohort year what are the total number of customers using that customer key and then also that total net revenue for that cohort so for this i’m going to start a new script and i’m going to go ahead and fill in the from specifically from cohort analysis and i like to do this mainly because when i go to fill it in if i do something like cohort year which is what we want one i can see that this column does in fact exist and also it does the correct syntax highlighting as i go along now with this we want to get the total customers per year so we’re going to do a distinct count so i need to do count or distinct inside of count specifically on that customer key we’ll give this the alias total customers now we need the sum of the total revenue so inside of our sum function i’ll use that total net revenue and we’ll give it the alias of just total revenue all right we did an aggregation function so we need to do a group by on that cohort year okay let’s go ahead and run this and just see what we have so we get back those total customers and then the total revenue now let’s look at this visually because i think it’s important to understand why we’re actually taking this a step further diving down to that customer level to analyze this well here i’ve plotted it where the bars are the revenue so you see it on the left hand side and then the line is the count of the total customers which is over on the right hand side as expected you can see that these lines basically correlate well to an extent to the size of the bars themselves so simply put mo customers equals mo revenue which that’s nothing new here and that’s not really any new insights that you’d go to your boss and tell them we actually need to dive deeper into finding out some key characteristics about the customers to actually give them insights of what their spending habits may be like so let’s get this customer revenue all we’re going to do is take our total revenue up here and divide it by the total customers itself and then we’ll give this the alias of customer revenue i’ll go along and run this query and now we have the customer revenue on the code here just a shout out to that views table look how simple this query is now now that we have the data in that view that makes this super uh quick to actually do this anyway back to actually exploring this customer revenue revenue over time we can see that it’s basically dropping over time let’s look at it visually and with chad gbt plotting this i have it showing that over time these customers spend quite a bit i’ll be honest uh they’re at around $3,000 per customer but then it starts to go down this is a exponential trend line that i had chad put on there anyway this is concerning that customer revenue or the per cohort year their revenue is dropping year after year i would expect like i mentioned at the beginning that either remains the same or goes up over time that’s not necessarily a good thing for this now i will say this remember that older cohorts so we’ll say in this case cohort 2016 they have all these years to contribute to their cohort so you could be part of cohort 2016 and also buy something in 2024 and so i would expect in general that earlier cohorts would have a higher customer res revenue so we need to adjust our query to account for this but you may be like how the heck do we do this do we use some sort of window function and limit the time of each of the cohorts and what is that is it like one day or one year that they’re in their cohort that you allow it to attribute to the customer or to the revenue for that cohort well i actually did some further analysis on this you don’t need to actually run this query and what does it show well we’re not going to walk through step by step all the parts of the query because that’s not important the main thing is what it provides out of it and that’s that what i have it plotting or what i had it do is go out and calculate what contributes to the total revenue based on the days since this first purchase so in total about $127 million were spent on day zero i.e the day of the first purchase and then after that it dropped significantly to like 31,000 51,000 and whatnot anyway with this total revenue i went and took it a step further and put it into a percentage and we can see that it goes from 61% to less than a percent i also plotted it for more of those visual type and so what we can extract out of this is that in general or on average a customer spends for the total revenue that it spends spends about 60% of it on the first day and then minimal after that so what we’ll do is go back in and adjust our query to take account for this and for a cohort year we’ll only look at the revenue for that cohort year if the purchase was completed on their first day and we won’t take into account anything else because the majority of purchases are done on the first day so how can we do this well conveniently in that view that we created we have not only the order date but also the first purchase date so we can use those two dates and match up where they’re equal to each other to get only get those purchases so i’ll put a wear statement in and we’ll set the order date equal to the first purchase date and that’s really all we have to do for this now pressing command enter we have some updated results and it looks like our customer revenue dropped a little slightly and plotting it we can see yeah it drops down slightly below 3,000 although it was before around 3,000 anyway the main thing here is now whenever i have this exponential trend line which i thought that you thought like with removing those previous years that had more spending actually it’s more pronounced that the future years such as 2022 and 2023 spend even less so this is a pretty big breakthrough that we’ve come to basically uncover in this and could lead to especially at the trend that we’re going at right now this could have serious implications on the business and would be a great insight to bring up to our superiors or to our stakeholders so that’s the end of what we’re doing for answering this question on analyzing customer groups as i feel like we’ve found a pretty significant insight with that what we’re going to be doing in the practice problems is actually going in and doing an analysis of the revenue and the total customer count but looking at it over time on a monthly basis to get to find out why do we have certain years lower than others and to uncover other insights that i’ll detail more in the beginning of the next video after you get done with those practice problems we’ll be jumping into installing vs code which we’re going to be using to document the insights for our project specifically in the next lesson we’re going to be documenting what we learned from this question specifically with that see you there welcome to this last lesson as we’re going through this chapter on an intro into our project specifically for this we’re going to be going through and setting up vs code now as a refresher on why we’re doing this and not using something like dbver dbaver is great at actually going through writing sql queries analyzing them and improving them but when it comes to actually sharing it and collaborating with others using things like github or even documentation tools like markdowns it gets quite hard so for both my workflow and kelly we like to use this in cooperation with vs code and this code editor is going to allow us to do two major things for this the first is we’ll be able to build a readme or a markdown file that will document all of the different analysis that we’ve done whenever we want to go and share this and the second it makes it super easy to push this up to github and share it with others to see the work from our readme or markdown file now if you take my basic sql tutorial you’d probably have vs code so you can skip that portion of the lesson but we will be going on to how to actually build out that readme specifically for question two which we answered in the last lesson but before we jump into that we’re going to quickly go over the analysis you did in the practice problems that we uncovered even further insights in the last lesson we went through and evaluated how do different customer groups generate revenue specifically we broke it down by cohort year and we found out what is the average customer revenue per cohort year at this macro level we were able to see some call out specifically that there’s a general trend going down for c uh per each customer revenue which is not good and so as our practice problem assigned you went further into analyzing why do we have these dips was there something deeper going on in the data set besides just customers spending less so to catch people up that didn’t do the practice problem we went through and analyzed customer revenue and total amount of customers on a monthly basis we got this final table which has that total revenue total customers and then the customer revenue let’s start with the customer revenue first because we’ve just been talking about that as we saw in that last lesson we saw that it’s slowly going down over time so analyzing at the monthly basis not really helping out that much now if we look at something like the total revenue and the total customers and we plot this we get something like this where the blue bars are the monthly revenue and the line chart here is the total number of customers so looking at general trends overall if we actually plot a line of best fit we would say or we would think that our revenue is going up over time or our net revenue is with the exception of a pretty big dip down in 2020 probably due to some sort of pandemic that happened during that time period and then it rose after that and then it was slightly less down in 2023 anyway the major insights that i think we are applicable to us from that last analysis is if you actually look at it how we said previously you know more customers equals more revenue it does match up but then when we get to 2022 and 2023 we can actually see that there’s pretty large gaps in between here there’s a lot of customers but the revenue is not matching which helps us explain even further what’s going on in this graph basically yeah we’re getting higher number of customers but customers are spending less anyway pretty interesting insight on this let’s get into installing vs code if you navigate over to the link on the screen you’ll get directed to the download page for visual studio code more recently microsoft has been advertising this with github copilot so it has this hey it’s redefined with ai and they’re really pushing that we’re not going to go too much into ai features we’re just going to be downloading this code editor you should download some sort of file click it get it launching and in the case of mac it unzips this file and it’s automatically the visual studio code app which i can just take and drag and put into my application folder so it’s in a much safer more secure location if you’re on a windows machine it’s going to walk you through an installer so quite a bit more steps but it’ll actually direct you on where you could actually put this vs code and if you want an icon anyway regardless you get a system message asking if you want to install this app that’s installed from the internet yeah you’re fine with it open it upon launch you’ll get this welcome message that will actually guide you through a step-by-step process that if you want to do you can do but we’ll be covering all the key features you need to know for this so don’t feel like you have to do this so let’s briefly explore visual studio code before we actually get into installing or setting up our projects folder that we’re working with over here on the lefth hand side is our activity bar and whenever we press our activity bar a sidebar slides in or out depending on if we want it there this first one’s an explorer we don’t have a folder open yet i’m just going to open a dummy folder you don’t have to need to do this and anytime you’re opening any of these it asks if you want to trust the authors i’m opening from my own computer i trust myself i think so at least anyway this basically shows a file breakdown of what’s inside of this folder and then folders themselves have these carrots that you can drop down or open if you want to see inside of it if i actually want to see these file locations you can just rightclick it and then on mac it’s reveal and finder on windows it’s going to be reveal and file explorer anyway you can see that the structure of this is the same as what we’re seeing over here in that file explorer all right other things in the activity bar we have a search functionality so if i want to search sql all the different occurrences will pop up here and i can go to it and it will take me right to it next is on source control which controls how we’re going to get this onto github we’ll be covering more of this and interactions with github near the last chapter or in the last chapter so don’t worry about this too much oh and i guess what i forgot to mention previously whenever this popped up over in this right hand side so if i’m to close this sidebar right here this is our code editor itself so if we actually open back up sorry and open that second query i’m going to close this now we can see i’m going to actually expand this by pressing command plus or control plus on a windows we can see that we have that sql query right inside there and if i needed to add anything like i did want to use an alias right here i could just type it in and then now we’re noticing that up here in the top there’s this white dot appearing that means it’s the ch we have changes this is not saved you can just save it by pressing command s or control s now what we’re going to be doing later on is actually going into the readme and building this readme out and what i really like about vs code is as you can see we have all this fancy dancy markdown language typed into here and if i wanted to see what it actually looks like with the readme selected i could select this right here and it allows me to preview the readme with all the different images and whatnot right next to it as i’m scrolling through so one of the main benefits why we’re using vs code all right i’m going to go ahead and close all this out also going to zoom back out so we can see everything um okay last two things they have a debug run and debug section we’re not going to be using that and then finally extensions it’s really popular if you’re using this for a particular programming language like python or whatnot to have the appropriate extension installed so in this case you install python to use python we’re recommen so overall there’s not a lot of extensions or really any that i think you need to install for this if you do want to install one just to see what it’s like i recommend this one on code spell checker when i click it it opens right up next door and if i want to install it i just click install it asks if i trust this publisher yes i do and now i have this code checker inside of here so if i actually went back to that sql file we had previously it will now go through and flag some of these keywords that don’t have an underscore and it calls it an unknown word you can actually go through and try to do a quick fix with it but those are the column values that came with the database so we’re not going to change them at all mainly i find it useful for if say i need to create a new alias and i’m going through it and i wanted to sum something like total customers and i assign the alias if i were to assign an alias with misspelling in it like this when butchering customers it’s actually going to call it out and so i know that i misspelled it there anyway that is extensions no i don’t want to save any of this the other two things in the activity bar to be aware of are your account right here and then any settings specifically what i find myself gravitating towards using a lot is this command pallet which has the shortcut of command shiftp or control shiftp that’s the one shortcut for vs code i would highly recommend having memorized when i do this this search bar comes up at the top and then i can search any type of settings i want to change in vs code so say i wanted to change the maybe the color theme of this i would type in something like color oh i can see that i have preferences color themes and then it allows me to go through a menu and select a host of different options in here the last thing to note is the status bar down at the bottom we won’t be using it too much like in our case right now i’m zoomed in and i could go back and reset it you’ll also have information down like the lefthand corner if we’re using git and then if there’s any issues going along with it let’s now get into setting our projects folder up that we’re going to be eventually pushing to github we’re going to be setting it up building out our readme and also adding all those sql files or the last sql file from the last lesson for this we want to open a folder that has our project in it but we need to create a folder if you will now back in dbe if you remember we created a project folder already and it has bookmarks dashboard diagram and even all of our scripts in it i’m not about reinventing the wheel i think we should just use this project right here as our project folder itself so we can do this a couple different ways i want to find the location of this so i’m going to rightclick it and i’m going to say hey show resources in explore now this is the projects folder i’m actually going to back back out just one location so that we can see okay so this is the folder itself of this intermediate sql project along with those folders underneath it i want the file path location to this specifically i want to go to this folder location when we go to open this in vs code so i’m going to open up a folder in vs code on macs unfortunately the folder location that this is within is hidden so i’m going to hit a shortcut of command shift period and i know in my uh home folder of luke baruse that it’s in the library folder and then from there i can navigate to the debaver specific folder going into my workspace i then see the project itself and then i can open it it’s going to ask if you trust the authors of this file in this folder i do i’m also going to just enable this to trust the files and all folders within here so now we have inside of our explorer right here we have a few different folders if you will i’ll be honest we’re not going to use any of these at all actually one thing to call out is you may visually only see the bookmarks diagram and scripts but then we also have these other they’re called dot files and once again if i press command shift period on this we can see these dot files they’re actually just hidden files i’m going to maintain them hidden by pressing command shift dot anyway key thing here is those folders and files aren’t important along with we can be selective on what we actually put into github so we will need to be because we don’t really want to put these up there anyway anyway let’s make our first file we come up here to the top and we select new file i’m going to give this the name of two i like to do two to basically designate hey this is the second question and call this cohort analysis.sql and as you notice as soon as i name that sql file i got this new icon right there that shows me it’s a sql file when i press enter it automatically opens in the text editor to the other hand side what i can do is now copy that query that we did previously then inside of vs code paste it all in i have this white dot saying that it’s not saved so i can press command s or control s and it’s now saved in there so i’m going to go ahead and just close this all out one thing to note inside of dber underneath that projects folder itself we’re not seeing any of the different files pop up like we just created that sql file that’s because we haven’t refreshed it if i actually rightclick select refresh the query now will appear inside of this project folder if you’re clicking refresh and it’s not refreshing or showing that i actually had to just restart db to get this to work so yeah just word of warning anyway this sql file is now here and so i’m actually going to close out this script and this one so if i wanted to i don’t necessarily have to go back into vs code if i want to edit it i could edit it from right here say in this case i call this ca and then i save this pressing command s whenever i come back in here and actually check this sql file i can see that the alias got added i don’t want it i’m going to add command s so let’s get into building our readme file as a reminder that’s going to be basically the front page of our project detailing all the different analysis we did breaking down each of the three questions that we’ve gone through or will go through now key things to note for github we want this file to be called readme.md and that’s because github will specifically pick up on this naming convention and then display this below here so we’ll get into creating a file i’ll call it readme and all caps locks then this icon changes to that readme icon and for the file it’s a markdown file so i’m going to give it m then go ahead and press enter and it’s open up right next to it first thing i’m going to do is just start by giving it a title remember we can do different headings depending on how many hashtags we have i’m going to give it this one of intermediate sql sales analysis but what the heck does this actually look like we can click this icon right here for it to appear right on that right hand side so as we go through and type different things we can see how it is actually formatted as we go through this now what sections are we going to be putting into this well really it’s up to you you don’t have to follow all or even any of the things that i’m going to put into here but i’m going to recommend these major sections first we’re going to have a short little overview then from there we’ll get into our three business questions just giving the short description and then from there getting into the analysis approach breaking down each one of those we’re going to be doing question two on the cohort analysis and i’m going to walk you through that shortly now below these three questions in the analysis approach i only included one example right here we’re going to have our ending which has things like our strategic recommendations what we got out of this and any technical details of what we actually used to build this so let’s start going through and filling this in we’re going to start with business questions here i’m going to put uh one two and then three for the second question we did a cohort analysis and with it we were asking how do different customer groups generate revenue now i’m not really liking how this is formatted so i’m going to use some extra markdown in here putting double asterisks before and after cover analysis and then it like bolds it makes it stand out more so now let’s go into filling in the analysis approach that second question i’m going to go ahead and just copy this one right here paste it below and we’re going to start going to for this i’m going to title this section cohort analysis next we need to put in an analysis approach that we actually used here so i put some short bullet points in here of how we track revenue and customer counts per cohort what is a cohort is that we’re grouping it by year of first purchase and we analyze customer retention at the cohort level the next thing i like to include is the query itself now you can go ahead instead of doing a link we’re going to go over link shortly you could put in a code block so i’m going to just do three back ticks in this case you can find it up here top of your keyboard anyway i could just copy this query right here and then put it into our readme and it’s displayed right here i could also format it as sql by putting sql after those ticks and then it’s getting colorcoded like this oh this is all smushed i’ll be honest i’m a fan of dry or do not repeat yourself we already have this code somewhere so i’m actually not going to put that right here instead what i want to do is put a link to this sql file and we can do this by putting square brackets and in square brackets is what the text is going to be i’m just going to name it the name of that file and then in parentheses is the actual file location on a mac i’m going to press backsplash i think windows you can press forward slash and then all the different things that i have access to are going to appear right here i’m going to select that first one of the sql file and then yeah now over here on the right hand side i can see whenever i click it oh the file itself actually pops up so i know the link is working properly and these links are also going to work on github when we get there all right next section i have are on visualizations and that’s if you’ve generated any images you don’t have to do this per se but in my case i really like doing this so what i’m going to do is i’m going to come over here and i’m going to create a new folder and call it images i like to organize all my images in one location so i’m going to take that image it’s on my desktop i’m going to drag it over into the folder itself it’s right here conveniently it’s just named image i’m actually going to change that by right-clicking it and selecting rename and call it this of two_cohort_analysis okay now going back into the readme itself the image name is just the alt text you can put with it mainly we need to be more uh pertinent about what the actual image name is once again i’m going to sl backslash and then from there i want to go into the images folder and i want to select two cohort analysis oh it’s popping up right next to it no i’m good after this we’re going to dive into key findings and i’m going to summarize this calling the main points that revenue per customer shows an alarming decreasing trend over time i call out specifically that 2022 and further years are just declining over time although net revenue is increasing is likely due to a larger customer base which we found out when we did deeper analysis and this finally brings us into the final section of what are the business insights and so for this i have the following that the value extracted from customers is decreasing over time and needs further investigation we need to find out what is the root cause of this in 2023 we also saw a drop in the number of customers and so we also saw a drop in revenue because of these two facts alone the company is facing a potential or actually what we saw in 2023 is seeing a revenue decline so overall this is a good step in the right direction on what we need to recommend on where we need to go all right it’s your turn to now go through and build out that readme document and hopefully you’ve been following along with installing vs code and whatnot we do have a few practice problems for you to go through and get more familiar with vs code if you want that practice along with we’re going to have that template for you available in order to build out this question number two all right in the next chapter we’re going to be getting into data cleaning my favorite part of data analysis so i’ll see you there welcome to this chapter on data cleaning and in this we have three lessons we’re going to be covering for this in the first two we’re going to be covering some core concepts you need to know about data cleaning specifically this lesson we’ll be going over conditional expressions for handling nulls things like coales and null if in the next lesson we’re going to be going over strings because from time to time you’re going to be dealing with strings and you’ll need to clean them up and maybe put them together or even separate them at the end of that lesson we’ll be applying all the concepts we’ve learned in order to further refine our view on cohort analysis finally in the third lesson of this chapter we’re going to be getting into answering question one from our project which focuses on customer segmentation now our project consists of three questions and in the previous chapter we focused on that second question on cohort analysis in this one we’re going to be using customer segmentation in order to find out who are our most valuable customers and we’re not only going to be using that cleaned up view of cohort analysis to help answer this but also some functions we learned earlier on statistics so we’re focused on two functions for this lesson and you may be like luke how the heck did you pick that out well if we go into the postgress documentation underneath the sql language we can see that underneath the functions and operators there’s a host of different ones that we’ve covered if you’ve covered along since the basics course we’ve covered and touched on a lot of these and we actually have covered on conditional expressions navigate under this we can see that for postgress there’s four main types and the main one is case which we covered back in basics but there’s two more that we need to cover around coales and null if now postgress has this one on greatest and least these functions have different capabilities depending on which database you’re working in also kelly and i don’t really use this this much so we’re not covering greatest and least anyway let’s get into how we can actually use coass and nullif in a very simple example you don’t have to follow along with this i’m just doing this for demo purposes so the easy way to demonstrate this is with a fake table i’m creating here i’m calling this a data jobs table it has three columns in it technically four i guess if you count the id and what does it contain well let’s just run this query to actually see we get this table and in it we have things like a job title a column on whether is it a real job and then a final column on salary notice inside of here that there’s some null values in here we’re going to be using coass and nullif in order to clean these values up depending on what we want so let’s say for this column on is real job we wanted to fill in null values specifically let’s just assume that the database administrator assumed that all null values were no but we needed to make it no well we can use the coales function and in this it returns the first non-null value from a list of expressions right now we’re just going to use one expression we’ll move on to two after this but we can provide a default value in this case of no and ultimately in our case this is going to be used to replace a null value with a default value so here is a query that returns back our original table let’s modify this to fill in null with no so i call that coales function for expression one i leave it as the column of is real job and then for the default value or the last one we’re just going to put in no let’s go ahead and run this bad boy oops forgot to put a comma run it again okay we can see now we have this column called coales after the function and it’s filled in yes no kind of a better practice would be to actually assign this an alias when done so that way we can actually see it and bam we have the updated column title now what’s going on here with that coales function where we have this second expression well let’s say we wanted to fill in this null value for salary but we didn’t want to use a default value we wanted to just fill it in with if it’s null maybe just put in something like the job title specifically depending on where it matches up you would fill it in for the appropriate row that it comes from let me demonstrate it okay so we’re going to use that coales function again we’re going to leave salary in there and then for the second column we’re going to specify job title i’m going to leave the default value blank for right now finally i’ll give it an alias of salary pressing command enter now whenever i run this i’m going to make this a little bit bigger says error colas types integer and character varying cannot be matched the problem is salary is an integer and job title is a string so anytime you’re using this to have a column replace other they have to be the same data type in this case we’d have to cast salary as a text or varcar in order for it to match that same one that is job titles now when i go ahead and run this it actually works below and we have in fact filled in the appropriate column from that job title into salary you could put a default value in here i’ll just name it default value but in our case when running it not going to come up so let’s reset this back so we can get into null if now with our original table back say we had a scenario where we knew certain values weren’t correct or we didn’t want them in there and we wanted to make them into a null and like in this column of is real job kind of isn’t really an answer maybe we want to now make this into something like null now with null if this returns null if two expressions are equal otherwise returns the first expression and this one’s even more simple in that it can have either expression one or expression two where they can be either columns or single values let’s jump into it so let’s say we wanted to replace this kind of with null we call our null if function is real job would be expression one and then expression two would be that kind of as usual i’m going to give this the alias of is real job okay let’s go ahead and run this okay we in fact replace that kind of with null now you don’t also have to just do a default value i could do like i said an expression so i could do another column so in this case i could do salary once again i got an error message and it revolves around having a mismatch between the data types i can just fix this by casting salary as a text running command enter and bam anyway the point null if right none of these comparing these as it goes through none of these match so it doesn’t convert any of the values to null this value was always null anyway let’s jump into some real bro practice problems now previously whenever we’ve been doing any of our analysis all of our customer keys have conveniently always had some sort of purchase associated with it what we’re going to demonstrate is that all the customers in that customer table don’t necessarily have an associated purchase with it and whenever we merge them together they can actually have non values or none values or null values now if we were to run an average to find out what is the average net revenue per customer whenever we just have these nine values they’re not going to be counted but say we do want to count them because hey they are customers and we want them to be zero instead that’s going to affect the average overall and we’ll actually get to demonstrating how much it’s going to change the average revenue per customer quite a bit now let’s get into combining our customer keys with net revenue to show those customers that don’t have any purchases previously we’ve gone through and in our sales table gone through and got the customer key um got the net revenue by multiplying quantity times net price times exchange rate then obviously we’re doing a aggregation and so we need to group by customer key now running this we have net revenue for all these different values in here if i try to filter to find any net revenues that are null if i go to run this we’ll see that down below there’s no values in there so we hadn’t been seeing this previously but what we can do is with all of these revenues that we have right here we could merge this onto our customers table and then this will expose customers that don’t have a net revenue so what i’m going to do is convert this into a cte use a width statement calls this sales data and then assign it in parenthesis as always i like to make sure that this works so i’m just going to select star from sales data and go ahead and run this yep working below now what we want to do is i’m actually going to go into kazo erd is take that customer table that we have here and merge onto it that sales table so we’re going to make the customer table our a table and then sales table our b table so what we want to do is using our customer table use a left join to join on our sales table so we’ll move the sales data down we’ll say that this is going to be the left join and we’ll give it the alias of s for the sales data and then for the from we’re going to be doing from customer with the alias of c let’s go ahead and just run this to see if it’s working and i got this error message saying syntax error at end of input basically i didn’t say where or on what we’re going to actually merge on specifically we’re going to be merging on the customer key of both of these different tables okay let’s try to run this now all right we have the customer keys and all the information from the customers table we don’t need all this information per se we just want to make sure we have all the customer keys along with all the different net revenues and as you can see there are now no values in here because there’s customers that don’t have net revenue so let’s modify what we’re actually bringing in here we’re bringing in from the customer table the customer key and then from that sales data uh cte above we’re bringing in net revenue running this boom simplified version of actually being able to view this so first let’s fill in these null values with a zero just to demonstrate it in a new column we’re going to call that coales function running that on net revenue and we want to place those nulls with a zero running command enter bam we got this over here so not bad now what we want to do is to show the difference between these we’re going to run an average on only net revenue and a average on the net revenue with zeros filled in so basically all customers i’m going to remove that customer key so that way we don’t have to do a we want to do an average on all of that so i’ll call the average function on that first column and an average on that other second column for zero filled in for the null values running this we can see that the averages are quite different right so the first one is around 4,000 and the second one is less than 2,000 now these names for columns aren’t that descriptive so i’m going to name the first one as spending customers average net revenue because they’ve spent money so that’s the only the customers that we use for this and the next one is all customers average net revenue now running it more descriptive titles for this and viewing it visually we can see that when we look at all customers the average net revenue is actually less now this was mainly done for demonstration purposes cuz there may be situations where you do want to consider all customers in our case we are going to just consider only the spending customers in our analysis and not necessarily all customers so that coales we’re not going to do a real world example of null if because it’s going to be frankly very similar except opposite if you will but what i have are practice problems for you to go through now and get familiar with both of these options in the next lesson we’re going to be jumping into understanding how and all the different functions for formatting strings so with that see you there and in this lesson we’re going to be going over further on data cleanup specifically around strings how to format them we’re going to be covering four key functions that i find myself using from time to time and then from there going into modifying our view that we created on cohort analysis specifically we have columns on a first name and last name we’re going to combine it into one let’s get into it now in the last section we were looking at function operators specifically going down here we were looking at conditional expressions in this one we’re going to be going back up into this section on string functions and operators now inside of here there’s a host of different functions and operators that we can use on strings and the first one we’re going to be jumping into is this one here on lower how to convert something to lowercase and with any of these functions they’re going to take string values so in that case i’m going to do that lower function and i’m just going to put a string in in there and we’ll just put my name in all uppercase we’ll go ahead and run this we can see it outputs it below in all lowercase if we have lower we probably also have something like upper running this we can see that it’s all upper this would usually in the case that you have some lowercase values in there and it would raise it all up if there were some lowercase values in there the next is the trim function we’re focus on the one up here this one down here is a non-standard syntax so we’re not going to use it and from this it relieves the longest string containing only characters and by default it’s a space let’s actually just look at this real quick to understand what’s going on so in the case of our example if we’re using this trim right now whenever i run this command enter there’s no really change in this whatsoever now let’s say that there was a space at the beginning and we’ll do a space at the end running command enter you can see that there’s no spaces when if we were to just run it without this trim function i’m going go ahead and actually just remove this running command enter we can see that it does in fact enter spaces in there even when i check it so using this function is very important especially whenever you’re working with databases with very dirty data and you need to remove any different spaces now let’s say that we had some symbols in there like we had dirty data and we had some symbols surrounding this that we wanted to actually remove in this case i have two amperands around each when i run this command enter we can see we have this but we want to remove that from here well going back to this definition of it we can specify whether we want to trim based on the leading trailing or both being the front or the back of a string and by default it does both is well both is the default then we can specify the character text which is what we want to remove and then from the string text so inside of here i could do something like both i want to remove that amperand sign and i remove it from this let’s go ahead and run this and it removes the both those amperands on the front and the back also i notice that my e is missing in here now going ahead and run it boom now that’s what we have so what we’re going to be cleaning up with this view well if we open it back up go into cohort analysis we can see underneath that data tab we specifically want to focus on this that given name and surname which will be like last name and first name anyway we want to combine these into just one column we don’t need them to be separated for our analysis now because of this we’re going to have to actually update our view for what we have currently now with this we’re going to be basically removing two columns and then adding a new column so because of this we can’t necessarily just run create or replace view since we’re altering columns we need to run like alter view but that’s even going to get complicated i’m going to recommend we just start over with this query and drop this view so i’m going to go ahead and go ahead and copy this command c and then in here i’m going to go ahead and paste it now remember we want to combine our given name and surname i’m actually going to just run this query to show what’s going on here press command enter i have this as in front of here that doesn’t need to be in front of here so i’ll move that up top run it again okay we have all of it as we saw before the given name and surname so we want to combine these two going back to the documentation on string functions operators scroll on down until we get to other string function operators they have on here the concat function this concatenates the text representations of all the arguments null arguments are ignored so for given name and surname what we can use is this with the concat function specifically i’ll type out concat open and closing parenthesis around here so we have the given name and surname and we’ll name this as the clean name okay let’s go ahead and run this and now we can see the names are now combined okay not too bad now one thing to note is well we need spaces in here and sometimes i find especially with text columns there may be extra spaces in here so one let’s just add that space i’m talking about i’ll do a i’ll do a single quote space single quote and then comma run command enter and now we can see that that there’s a space in between here but like i said sometimes the names may have spaces around them so just as good measure i’m going to put trim around both the given name and around the surname okay running this pressing command enter we now have all the values cleaned up in here and we did some protection so now we need to actually update this cohort analysis where if we went to look at it remember if we just tried to run this right now so create a replace view as the cohort analysis pressing command enter i’m not going to get it because of the column issues that we addressed before so we need to actually just drop this database or drop this view first so we’re going to call that drop view on cohort analysis and then run everything underneath it i’m going to run this all by pressing option x and looks like two queries are done as always i’m going to just close out of this to make sure i have the most upto-date one click inside of here press f5 to refresh and open up cohort analysis scrolling all over we can see we have now that clean named inside of here so our view is good to go now for the project all right you got some practice problems now go through and get more familiar with these text formatting functions in the next lesson we’re going to be jumping to another question for a project on customer segmentation looking forward to it see you there welcome to this third and final lesson in this chapter on data cleaning and for this we’re going to be focusing on question one for a project specifically this is going to build further on analysis we actually did earlier with segmenting customers and some discussions on customer segmentation specifically we’re trying to find out who are our most valuable customers for this we’re going to be breaking up our customer into tiers using percentiles into highv value midvalue and lowv valueue customers now this is a very typical business process that you would find yourself doing in order to target certain customers and then distribute marketing that fits their need and so shout out to kelly for coming up with this example because i feel it’s a really good demonstration of what you find yourself be doing as a data analyst as always i like to start with what is the final data set we’ll be getting for this and so we’ll be calculating or actually finding out based on customer key and that clean name that we did in this chapter to determine what is their total ltv lifetime value or their if you will net revenue and then based on these values we’re going to use percentiles to categorize customers in either to low value midvalue or high value we’ll also take this calculation a step further and also dive into analyzing not only just having those names that way marketing can target these customers but also actually understand these values such what are the percentages of these different segments how much they’re contributing and whatnot so let’s start a new sql script documenting this analysis similar before we had a our own script for sql here i’ll go ahead and we’ll start a new sql script then i’m going to go up here and then rename this we’ll name this to one_c customer segmentation.sql okay this should be good all right now one thing to note this is going to be inside of our scripts folder but we don’t necessarily want it here if i rightclick this and then go into show resources in explore underneath scripts i can see it’s right here and i actually want it higher up so i’m going to move it out if you’re on windows you do something similar with your file explorer and then down here it’s not actually showing up i can press function f5 and that sql script disappeared but now i’m actually seeing this one and also readme is popping up now i guess i didn’t uh refresh as well anyway i can now open this back up this is what the sql file is we want to work with so first three columns of interest that we want to get into here so we want that customer key we also want that cleaned name and then from there we want the revenue for each of these customers or that total lifetime value so we use a sum of total net revenue and we’ll assign this as total ltv just as a reminder going back to that cohort analysis we could have multiple entries in here like 180 did multiple purchases on different days and we had that in a total net revenue so that’s why we’re renaming this new column total ltv because now with 180 we’ll have the total lifetime value we want this from our view of that cohort analysis and we’re doing a um aggregation so we need to do a group by using the customer key and then also clean name so let’s go ahead and run this and see what we have i have no active connection so if there was a reminder if you’re already connected reconnect to your database and i just need to reselect it up here now everything’s looking good let’s try to run this again all right bam this is what we want clean name customer key and then the total ltv for each of these i can even i can even do the order by descending or actually i want to do order by ascending and see that hey okay 180 is now combined into one looking good now that we have this total ltv we can now bucket these customers into high value low value and what’s the other one midvalue now we’re going to be doing this on percentiles using the 25th percentile and 75th percentile so because we’re using or running a percentile on this aggregation right here i’m going to put this into a cte we’ll call this customer ltv and we’ll put this in parenthesis from there on this ct we’re going to run that percentile continuous function remember we’re doing the 25th and the 75th percentile basically everything between the 25th and 75th percentile is our midle so i’ll do 0.25 and then we’ll do the nomenclature of within group because we want to group it by a certain thing and we need to order the ltv or the total ld ltv column to make sure we’re pulling out the correct um ltv as we go through and sort this and then we’ll give this the alias of just ltv 25th percentile let’s go ahead and just make sure this is right so i’m going to do a from the customer ltv pressing command enter boom okay 25th percentiles at 843 i did this already i know that this is actually correct so let’s now get the 75th percentile i’m going to go ahead and just copy this all because it’s just going to take some changing to do to it and then putting it inside of here changing that two in multiple locations to a seven and then running this bad boy boom so just so you understand what’s going on here those that spend around $843 are at the 24 25th percentile and so if you spend less than this you’re less than the 25th percentile whereas the 75th percentile those are around spending around $5,500 if you spend more than this you’re greater than the 75th percentile these will be our high value customers and those less than 25 will be our low value now what we can do is let’s just go back and i’m going to run just this query up here now what we can do because we have those percentiles we can use this basically customer ltv cte and we’re going to convert this percentiles one into a cte as well we can categorize or bucket them using a case when statement of whether they’re high value or low value or mid-value so we’re going to be making this into a ct i’m going to tab this over put a comma and we’re going to name this customer segments and then as once again opening and closing parenthesis so that way this is in a in a cte from this we’re going to select and i want all the customers from customer ltv so i’m going to do a um we’re going to give it the alias of c so i’m going do c.star and i’m going to go ahead and just put this down here of from customer ltv and like i said that’s going to be of the alias c and it’ll clear that little syntax error let’s actually just make sure that everything’s appearing right down here i have a comma after this so i need to remove that it’s appearing down here now we need to go through and now make that case when to basically bucket all of these into their different tiers gonna move this down some so it stops uh so it stops cutting it off okay we’re going to do a case and then we’ll do when and the first one i want to categorize is everything less than 25% as low value so when total ltv is less than this ltv 25 percentile which i’m realizing now we haven’t imported in now we don’t necessarily need to do a join with this what i can do is a comma because we’re not join it to the data i can just list it of customer segments and give it the alias of cs so now that’s available we can say cs.ltv 25th percentile then we want to assign it the low value we’re going to do one tac low value we stick a 1 2 3 at the front just to make the easier if we want to ever sort it um if you just do low value only and then try to like sort by the name alphabetically it like throws a fit mainly meaning you can’t sort it like that anyway let’s go into the next one and i’m just going to copy this because a lot of this can be repetitive and for this one we want to get everything that’s midvalue so everything that is less than or equal to the 75th percentile so that mid-range is going to cap encapsulate everything that’s equal to the 25th percentile up to the 75th percentile and equal to it and then i’ll change this to two mid value now since we got everything underneath the 75th percentile we can now categorize everything else as high value all right let’s go ahead i’m going to remove this space here let’s go ahead and run this and fingers crossed i see my issue i have a syntax error at or near when and that’s because i don’t have a comma after here also i never gave this alias or this case statement a name specifically we want to name this customer segment all right let’s go ahead and run this bad boy okay it’s working now let’s make sure that these are categorizing correctly if i remember previously i’m actually going to just select this one time and press command enter oh and i realize now it’s uh not going to let me do it like this unfortunately i’m going to just copy this go into a new script because i want to actually show this then paste this bad boy in and run this and i’m silly i need the other ct anyway you don’t have to do this main purpose of this is just to see what these numbers are remember it was 843 for the 25th percentile and 5,500 for the 75th percentile now running this complete one are these numbers making sense yeah because 5500 should be the high value it’s greater than that for this second entry and then looking at these these fall in between yeah it looks overall let’s look for a low value okay we got a low value down here of $23 falling in okay data looks like it’s calculating correctly all right so this table right here would be great to now export and send to our business colleagues to basically find those people that we want to target i don’t know why this window is so wide like this um but on mac let me know if you’re having these same problems it shouldn’t be this big anyway i’m not going to do this right now but this would be great to send to business colleagues and for them to actually now send targeted campaigns to these individual items depending on what our strategy is and how they are segmented but let’s dive a little bit further first to analyze how much do these different customer segments how much um how much are they contributing what is their customer revenue and also not only what is individual and average customer spending but also what is their total revenue so what i’ll do is i’ll now convert this into a ct as well cts on cte on cte i’ll give this the name of segment values and then as and then once again open and closing parenthesis with this we’re going to just go first and get the customer segments and mainly that customer segment column i mean and we’re just going to get this first from that new cte of segment values okay let’s make sure that this is just working correctly and we can see it’s doing this all down here so the first thing i’m going to do is get a sum of all the different revenues so we’ll run sum on total ltv and we’ll name this conveniently total ltv because we’re doing a summation we also need to do a group by on that customer segment all right we’ll go ahead and run this all right not too bad so this is telling us the total amount and our high value is at 135 million whereas our low value customers have only contributed 4 million if we plotted this on a pie chart to see how much they actually contributed part i only recommend pie charts if it’s three or less values we can see that low value is oh my gosh we got to do something here and target these lowv value customers better so this is great the analysis that we found out of this midvalue is around 33% which you would expect it to be about a third and then the high value is almost 2/3 so that’s pretty i mean really high for high value so based on just this little data piece alone this is evidence enough that we need to do some different marketing strategies especially with our lowv value customers now because we put those numbers in the front of this customer segment if we wanted to we could also do an order by and we could do that on customer segment and we could put it in descending order running this we’re now getting those high value up top mid value low so that’s why we put those numbers in the front for those columns so we can sort it more easily anyway let’s calculate a couple other things mainly i want to find out what are the number of customers in each one of these segments so that way we can then go through and find out what is the basically average ltv or lifetime value for a customer in a certain segment so getting the count first we’re going to be using the customer key for this we’ll assign this as customer count running this we got the counts and as expected one and three should be equal because there’s 25% in this 25% of that and then this one right here should equal the basically double of these um because it’s the 50th uh it’s 50 percentile or 50% in between so now that we have this total ltv and this customer account we can divide these to get the average customer value so i’m just going to take this and then divide it by this value right here and this will be our average ltv bam all right now i’m going to close out of this right here so this is pretty interesting our highv value customers are submitting or submitting on average around $11,000 whereas our low value are only around $350 pretty substantive probably why those low values only contributed around 2 or 3% of that total revenue now there’s a host of different marketing strategies you could go about doing this you can feel free to pause the screen and look at into each one of these of what you could go for doing this isn’t necessarily a master class on marketing strategies it’s on data analytics we’re not going to spend too much time on this but i did want you to understand what are some different capabilities we can do with this powerful data now there’s no practice problems for this lesson but i do expect you to go through and update your project readme specifically i added a little description of what we’re doing for all the different segments i added a link to our sql script along with that visualization that i showed you earlier breaking down those things because that was the key insight we got from this i broke down the statistics talking about the highv value midvalue and low value segments how much they contribute of each and what their contribution of revenue is and the big disparity there and then i wrapped it up with business insights what could we potentially do to target highv value midvalue and low value customers all right it’s your turn now to go through and update all that we’re jumping into the next chapter in query optimization where we’ll not only go query optimization but also answering our third and final project question see you there welcome to this chapter on query optimization for this we have three lessons we’re going to be going to the first two are going to be focused on understanding how to use the explained keyword along with some query optimization basics the second one’s going to jump into more intermediate and advanced ones and then finally in the third lesson we’re going to wrap it all up with our final problem for our project now in this video in the next one we’re going to be going over query optimization techniques and i have a list here of beginner intermediate and advanced techniques you should be familiar already with beginner ones but we will do a refresh during this video and then in the next one and the next lesson we’ll be jumping into that intermediate and advanced and this will all be done while using explain and analyze to break each one of these down now for the third lesson it’s conveniently on the third question or the last question in our project specifically we’re going to be doing retention analysis analyzing who hasn’t purchased recently we’re going to get a visualization similar to this and wherever we break it down by the different cohorts years and see how many customers are active and how many are turnurned or didn’t purchase something recently this is a super common business concept to understand and coincidentally kelly was just telling me that she actually was implementing it in her job today so let’s get into breaking this down we’re going to be using these keywords of explain and explain analyze for each of these they just go at the beginning of a sql command whether what you’re using but what’s the difference between these two well explain demonstrates the execution plan without actually executing it whereas explain analyze basically means like it’s going to analyze it and it actually does execute it so we understand what the execution times are so say we have this simple query of select star from sales i could use explain at the beginning of this running command and enter and it’s going to tell me the query plan i’ll break this down in a second now we also could use something like explain analyze right and remember this one is one row when i run explain analyze we have two more rows mainly this one has the execution time it has well not only the planning time but also execution time so you may be like luke when the heck would i use explain and when would i use explain analyze like why would i want this if this doesn’t even tell me the execution time well let’s say you’re working with an extremely large database like millions or even billions of rows it could be extremely cumbersome to run this query and cost a lot of money not only time but also money so there may be cases that we would only want to use explain but since this database we’re working with is so small we’re going to always be running explain analyze with all the queries we do that way we can also see the execution time so let’s break this down on what this is actually providing so this first row here says that it does a sec scan which means it does a sequential scan basically going row by row by row and it specifies that it’s doing this on the sales then from there we have three variables inside of parentheses here cost is just an arbitrary unit and that’s just assigned by postgress if you want to be real about it it’s just made up but quantitywise it remains consistent the one thing to remember is that these numbers just because say you have a cost of 500 and then another one you have a cost of a,000 the query is not going to necessarily take double the time it’s just going to take longer anyway with this cost you can see it has the syntax of this starting value then this dot dot dot and then the next value after this this is the start cost and then this is the final cost so ultimately this query cost in this case that is shown in demo 18.5 next are the rows and so that’s the estimated number of rows and then finally the width is basically still that row size but what it is in bytes going back to our original query we can see that we have a cost a final cost of 4500 there’s almost 200,000 rows and the width is 68 bytes ultimately this took this query took 30 milliseconds to run and the planning time or what it was going to do whenever it was planning on how to execute this query took less than a millisecond now all the times with dealing with all of this are going to be in milliseconds so you may be like luke what does this even matter like we’re talking about 30 like half a second well once again this is going to come into play whenever you’re dealing with databases that are millions and billions of rows those milliseconds aren’t going to be that anymore and i’ve had queries run as long as an hour so query optimization is a must for you to understand now expand this out we can actually see there’s some other things as well to cover here specifically going along with that execution time we have the actual time the rows and then loops rows remain the same in this case loops is a more complicated topic and really relies on if we’re basically performance and sort of recurrent loops especially whenever we do joins we’re not going to worry about that too much but what i do want to focus on is this actual time right here and this tells us that hey we started this query at 017 milliseconds and it ended at 14.8 milliseconds and then the time after this of the execution was the time it took to display it and do all that now it’s important to understand that with explain whenever i run this one it does not have that extra parameter over there as well i didn’t have it uh pulled out at the moment but we still have that same thing of cost rows and width okay so let’s build on this further by calculating the net revenue per customer and seeing how this query plan changes for this we’ll say we need that customer key and then like usual we’re going to be doing a sum and this will be summing up the quantity times net price times the exchange rate and we’ll give it the alias of net revenue okay let’s go ahead and run this bad boy and i got ahead of myself we got to do a group by anytime we do an aggregation so i’ll be specifying that customer key and then running this okay now with this what’s going on here is there’s actually two steps here each of the steps are denoted by this basically this arrow right here so this right here is a step and then this up here these three rows are now a step the first step would sort of counterintuitive but that first step is that sequential scan on sales it’s the most indented one in the fact that we’re doing the scan on the sales table getting all the rows we needed of 199,000 and we can see that that took 9 milliseconds to do then the next step goes into here performing the hash aggregate basically it does a hashing system in order to perform the aggregation we’re not going to go into hashing right now the important thing is understand that we are doing a sum function which is an aggregation it tells us it takes from 54 to 56 milliseconds so about 2 milliseconds to do this and this is done on only 49,000 rows and that’s because that group by or that customer key condenses it down the amount of rows under this it has other other information like group key and then also how much memory was used ultimately this query ended up taking slightly longer than we did previously now we’re up to 57 milliseconds total i’m going to add just one more thing to this and let’s add a filter to it and we’ll say we want orders only from 2024 so we’ll say that it’s greater than or equal to 2024 january 1st okay let’s go ahead and run this and we got an error for this why do we get an error right here well it’s because i have it out of order and that’s because even in the execution aspect we’re going to filter our sales table by that in 2024 and then perform our group by and aggregation and we can actually prove this by our execution plan in the fact that in our first step right here our sequential st scan it actually goes through and filters by those dates so that are in uh 2024 then after this which after we filter it for 2024 we can see that we have down to 10,000 rows then it sends it into our aggregate to do the group y and our sum and this takes 27 to 28 so this is like well even less than a millisecond and ultimately because now we do this wear clause we have a much shorter execution time so not only are we learning about how to read query optimization we’re understanding why do we have the order of these keywords such as where and group eye anyway it’s important to note that we’ve been using explain and explain analyze but over here on dbver they have an explain execution plan now if i try to run this with this explain analyze up here and click this and click okay it’s going to give me an error because we already have explain in there so it’s important that you select what you want to use and then use the explain execution plan or the shortcut of command shift e now this is going to pop up and asking what you want to do if you don’t want to do the explain analyze you want to leave that unclked i usually just maintain all of these clicked including the analyze and then from here click okay this i find is slightly less descriptive but it is more ordered in the information that it provides it’s still pro uh still in that same order of sequential scan is the first step and then the aggregate is next starting at the bottom going up but as far as the times and every and the cost they’re actually more in a readable format than that execution plan anyway is available if you want to use that way we’re going to compl continue to use explain analyze throughout the reigning of this video and the next video cuz i find that post useful last thing note is if you encounter any keywords you don’t know the best thing to do is just go ahead and copy this all and go into your favorite chatbot just paste it in and have it go through and actually explain what’s going on here step by step by step so in the remainder of this video we’re going to be going over some beginner optimization techniques that we’ve touched on briefly throughout this course but actually using explain analyze to prove why you should be doing these in the next lesson we’re going to go into more intermediate and also briefly cover some advanced techniques in order to further level up your optimization skills for this one on basics we’re going to be covering three ones and we’re going to be going over examples on the first two on why we use select star why we use limit and then for the third one we’re just going to just briefly discuss of using where instead of having we’ll start with the easiest one to actually prove it’s efficient and that’s using limit if i just run select star from sales on the entire sales table running command enter it’s taking around 18 seconds and what what you will notice for this i’m going to run this a few times the time actually jumps around so on average it looks like it’s around i don’t know around 24 or so anyway i have this in a not not optimized sql query i’m going to come over here to the optimized query so we can compare them before and after and we’ll put a limit statement we’ll just say we want 10 and this bad boy we get it in 03 milliseconds run this a few times yeah it’s maintained pretty consistent around 03 compared to our previous of almost 21 so there’s your actual proof that those limit statements are very helpful in minimizing the amount of data and saving you time next is on select star you’ve heard me time and time again saying “hey i don’t recommend using select star to select all the columns of the table.” so let’s try to optimize this and you know what i’m just going to list one column and that’s it customer key whenever i run this we can see actually i need to run a few times we can see that it’s over 30 seconds whereas the not optimized one whenever i’m running this one it’s less than 30 seconds what the heck’s going on here well in some cases like this one postgress makes it super efficient for them to retrieve data using this select star nomenclature and so yeah it is more efficient in some cases to use select star but i’m still sticking with it in the fact that i do recommend especially when you get into bigger databases millions and billions of row i would still stick with only listing one or however many columns you need for your analysis and not using select star now the last one to look at is using where instead of having and unfortunately you may not always have control over just easily switching from where instead of having so let’s say we have this query here where we’re actually getting the customer keys along with all their net revenue for it now if we wanted to filter this data based on the net revenue and maybe get some that are higher than or less than a certain value we would use a having clause in this case and we say hey we want that having greater than 100 or greater than a thousand and if we remember from order of operations that we first do that sequential scan so we get all 199,000 rows then when we’re doing the aggregate we then are doing the filter in this step with that 199,000 rows now we’re moving back to that original query to demonstrate this now with a wear clause unfortunately the having benefit is using it in aggregation but we may have a case where okay we can alter this instead to have it to where we want to get customer keys that are we’ll say less than 100 so not using that aggregation but more posts on a customer key anyway with this one the main point is that the filtering is done in that sequential scan out in the beginning and as we can see from this because it limits with how many rows are done our execution time is a lot shorter now you may be like luke you’re using where with a customer key of less than 100 and you’re using having the aggregation that we did of greater than a thousand yes i know these aren’t necessarily comparable but there may be a situation where yeah you know you want these net revenues a certain value and thus you want the maybe the order keys of a certain thing and you could then filter instead by the certain order key values or specifically not order keys sorry customer keys mainly what i’m getting at is if you have a choice that you could potentially modify to do where instead you need to take advantage of it all right you now have some practice problems to go through and get more familiar with using explain explain analyze explain feature inside of dbver along with testing out some of those basic techniques that we just went over the next lesson we’re going to be going over some intermediate and also advanced techniques along with a real world price problem all right with that see you there welcome to this lesson on optimization techniques we’re going to be starting by jumping back up where we picked off last and jumping into intermediate intermediate techniques we also briefly cover advanced techniques but overall they’re going to be outside the scope of this course and you’ll see why then at the end of this we’re going to be going into optimizing our query that we built in the last chapter on data cleaning basically applying all these techniques we’ve learned into how we can make a query run faster so what are we going to be covering for these intermediate techniques well we have four ones to cover but the first one we’ve really been covering in the last lesson and in this lesson of using query execution plans basically using that analyze basically using things like explain explain analyze or even dbver’s built-in options now we’re going to be going over three other scenarios besides this on minimizing group buy reducing joins when possible and optimizing order buys so let’s say we have this query here where we’re going through and getting things like the customer key order date order key line number and then also getting an aggregation of the net revenue which i need to add an alias of net revenue okay let’s go ahead and run this just to see what’s going on anyway we can see from this that well it’s two main steps of doing a sequential sand and then actually aggregating with our group five but our execution time even running this a few times is pretty high up there i mean sometimes getting as much as 100 milliseconds once again we’re dealing with milliseconds but if you have databases that have millions and billions of rows this can easily turn from milliseconds to seconds soon as those queries get longer than something like one or two seconds they get annoying af anyway with this query itself i’m just going to select it and run it we may not necessarily need to find out every individual line number and so if we were to go ahead and remove something like n line number in our case i’m going to go ahead and move it over to this optimize area and then take it out here along with taking it out here and then running this oops got a little bit of typo try again okay we can see that now we have a much lower execution time here at 67 seconds but consistently less than 100 that was just by removing one group buy so an important concept to understand is it really necessary to do all those group buys it can get costly over time and in this case we’re almost saving half the time if you will just by removing one group by next concept to get into is minimizing the number and also types of joins if you will on when we’re doing a query in this case let’s say we’re pulling in multiple tables into our sales table we’re also pulling customer product and date table i’ll go ahead and just run just the query itself and we can see we have yeah a lot of information in here now running this full query to actually run the explain analyze we can see that this query takes over 100 seconds running this query a few times we can see it runs around h around 80 milliseconds anyway pretty intensive anyway if we go back to that original query in this case we can see that we’re pulling in the year from the date uh table what happens if we remove this interjoin and just added in a way to extract out the year out of the order date so i’m going to go ahead and just copy this all paste this into the optimized i went ahead and already moved the inner join down here and then we’re going to just write an extract function using year from s.order date and we’ll give this the alias of year okay and we can clearly see whenever we run this this is providing us the exact same information of that year running the query completely now i need to move it up some running this one we can see well it’s looked like it’s maintained around 70 so the other one i already forgot what it was running around it’s running around 80 or so 80 or 90 so this one is i mean almost we’re getting a 10% gain or increase in performance just by removing an inner join and instead doing a function instead now it’s also important to note that establish the number of joins but also the type of joints and we’ll get into that with the practice problem coming up in a little bit last major concept to cover is optimizing your order buys sometimes order buys are not something you can just like negate but if you can they will save you some times specifically here well let’s just go ahead and print out this query pressing command enter i’m getting the customer key order date order key and net revenue and in our order by we can see we order by our net revenue first followed by customer key then the order date and then the order key rarely do i find ordering by all columns is really necessary anyway there’s a few different ways we can optimize our order by the first one the easiest is just limiting the number of columns in order by the second is avoid sorting on computer columns or function calls the third probably the most intuitive is place the most selective columns first in the order by basically if it’s doing something that’s going to filter out other rows you’d want to use this one first and then finally use index columns for sorting to leverage existing database indexes unfortunately we don’t have control over indexes indexing usually a database administrator does but if you did you’d want to use it in that case anyway let’s go with one of the recommendations of removing function calls so we’re going to go ahead we have explain analyze up at the top i’m going to run it and i’m seeing it around 90 milliseconds or so now i’m going to take this exact query put it over here and we’re going to remove that well the net revenue the aggregation up here on the sum let’s go ahead and run this and this one i’m seeing a lot less usually around the 70 to 80 millisecond range so we just cut off about 10 to 20% doing that alone now maybe we find out even more that we can remove the order by let’s say we want to just remove all the way to the customer key itself and then run it in this case i’m not finding that much of a difference even seeing it get as high as 90 so i’m not finding that this one is as great as the aggregation but overall we can see that it does have an appreciable impact compared to our notoptimized query so let’s get into optimizing our view that we’ve previously worked with and that was with cohort analysis we can go into it under databases right underneath our data set underneath schema public and then our views ourselves that cohort analysis now we can actually look at the query itself going under source and we have it all here i’m actually going to go ahead and just copy this bad boy and put it into this script here on not optimized i don’t actually create or replace any view so i’m going to actually remove this and instead put up at the top of explain analyze so i’m not liking how this is formatted so i’m going to highlight it all go to format and go to format sql and it’s going to break everything out more of like how i like it now let’s go ahead and run this explain analyze on here to see what we’re working with for our current execution time um and i have a typo because i didn’t get rid of this as in the front of here going to go ahead and run again okay boom we have this all and wow there’s a pretty hefty query plan and we can see this is some of our highest execution times that we’ve seen so far around 200 milliseconds to get this bad boy done it looks like it has a total of 1 2 3 4 five six different steps and that looks about right with the ctes and all the different group eyes we have in here now recalling back to what we just previously covered of how we can improve a query just looking at this we can see that the cte well it has a a bunch of group eyes also has a join so that may be able to be optimized and then we look down here to this bottom one this bottom or the main query itself and there’s not a lot of different techniques that i feel i can put into this as this is just doing a simple select and then from so primarily we’re going to be focusing on inside of the ct of customer revenue and the first one we’re going to focus on is the join now previously we discussed about minimizing joins but actually which is also just important is understanding when you should be using what type of join now a lot of our course and the previous course used either left joins or inner joins with left joins they’re specifically used in the case of for table a you want to keep all values in it and if there were maybe some null values that matched up with from the b table for the a table you would want those null values to fill in for a so we didn’t remove any of the a well if we knew what we were matching on had all contents from both the a table and the b table thus there are no nulls inner join is slightly more efficient to use because we no longer have to do this null check before actually joining so what i’m going to do is i’m going to copy this query and put it over into the optimize section and i’m going to change this here from left to inner the first thing i want to show though is the actual query itself the output pressing command enter what i can do down here is i can i want to see the total row count so i click this okay that’s around 83,099 if i go back to our one with a left joint i hope that whenever we run this query i’m going to select it all and then press command enter we can see for this one the row count is 83,999 so the same so they’re still doing the same thing where this one has a left join and this one has an inner join so now the question is is the explain analyze whenever we run this on the nonoptimize and on the optimize is it going to be quicker well for the not optimize or with that left join it’s around 200 milliseconds and that of optimize is well around the same thing around 200 milliseconds so although it didn’t work in this case to further optimize our query and just works out to basically break even there are cases where using an inner join instead of left join when appropriate can save you potential time in your query execution all right so we talked about joins the other last thing that we can do in order to optimize this query is has to do with this group by look at this we have a bunch of group eyes and when we actually look at it based on what we’re aggregating by we’re having a lot of repeating values and country full age given name and surname what do we mean by this well let’s actually look at the query itself for customer say 180 yes the order date is going to change yes the total net revenue number of orders but as far as things like their country full the age their clean name their first purchase date or even the cohort year that’s not going to change so why are we doing a group by on things that aren’t going to necessarily change we really care about just grouping it by the customer key and then also the order date so what we can do for this is like we said we want to minimize those group by but this query is not going to work if we go to run it right now it’s not going to work anymore what we can do is we can do an aggregation function with this we just need the max value from each of these or you could do min whatever not it’s very popular just to do max and i’m doing this for age the given name and then the surname with this it’s also important that you give it back its alias so i’ll be giving it for country full age given name and also surname so first let’s go ahead and just run this query and make sure that it’s working properly scrolling on over here we can see that everything’s remained the same for the country full the age and then also the given name or that that final clean name that we got to i think i called out first purchase year and cohort year when i first talked about this they didn’t have anything to do with the group by they’re done lower below anyway let’s go ahead now and run this query see how long it takes and with this one we can actually see that we’ve now got the execution time down to 160 milliseconds where previously it was around that 200 milliseconds so now that we have this optimized query that’s taking less time we can go ahead and update our view and we’ll be using that create or replace view cohort analysis as so we’re not changing any columns with this so technically this should work without doing a drop i’ll go ahead and run this it’s now telling me that it’s changing data type so in fact we do need to delete it or drop the view first of cohort analysis we’re going to now just go ahead and execute the entire sql script said it ran both the queries come over here press f5 to make sure that we have it refreshed open up cohort analysis and i can see that we have those max values and we’ve minimized that group by along with changing our join and it looks like it actually saved it to a simpler join now now that we were using that inner join to just join alone which that’s the default so makes sense now the last last thing to cover are these advanced optimization techniques we’re not going to be walking through any of these but they are ones that you should be aware of specifically they have these three major ones of using proper data types so basically not referencing data uh integers or numbers over something like a string using indexing in order to speed up your queries so basically relying on certain columns with indexes built in to sort them more quickly and then for large tables you can have partitioning built into them to per uh improve their performance now all three of these are controlled by database administrators specifically they control the data types they control whether columns have indexes and they control how data is partitioned so we’re not going to go into this because it’s going to be really specific on whether your database administrator has done this now if you ever run into a situation where you’re finding queries are running excessively long and you plug into chat gbt and you can’t find any results except for this around indexing or partitioning or data types this is when you’re going to have to go to your database administrator and ask them to make changes to your table that way you can get more efficient queries hopefully you have a good database administrator i have in the past and been able to go directly to them and get what i needed out of it and saved me a lot of time in the long run all right you have some practice problems to now go through and get more familiar with those intermediate techniques and optimizing your query along with using explain and analyze again all right with that i’ll see you in the next one where we’re getting into our third and final problem in this project see you there welcome to this last lesson in the chapter and for this we’re going to be tackling our third and final question looking into performing retention analysis specifically we’re going to be looking at who hasn’t purchased recently we’re going to use terms such as active and turned customers we’re going to look at it totally overall and then from there actually break it down into the different cohort years to see how it’s actually trending over the years as we have these different cohorts so we’re trying to identify which customers haven’t purchased recently and the technical business terms for this would be we’re trying to act identify active verse churned customers for us active will be those that have made purchases within the last 6 months whereas churned are those that haven’t made a purchase in over 6 months now 6 months isn’t necessarily something you’re always going to use as the hard and fast to delineate between active insurance it’s really going to depend on your industry and maybe even other factors as a general rule of thumb i have these four different areas and contazo falls into an e-commerce and typically we’d see them use a 6 to 12 month period since last purchase for this whereas something like a mobile app is going to be much more quicker with turnover so they’re going to have a 7 to 30 day since last session to identify active verse churned customers and now you may be wondering why the heck this even matters well we can send off this data that we end up calculating on finding out if a customer is active or churned and we can do specific targeting marketing campaigns in order to get them to re-engage also when we look at this holistically towards the end to get these like percentages per cohorts and stuff we can understand maybe the effectiveness of previous campaigns that we’ve used in maintaining activeness and preventing churn overall this deals with tracking our customer retention and also engagement which is necessary because we know we have customers that have bought from us before and they’re likely to do it again so we need to take use of that so what are we going to be working towards well we want to basically build this table here which is going to have information like our customer key and our clean name along with calculating things like when was their last purchase date and ultimately was this in the last 6 months and then classifying this as either churned or active to make this easier we’re going to be using the view that we’ve been using of cohort analysis because this has all the information that we need from it in order to extract out this information now like our last two problems i want to be working in a script that we’re going to be saving as our final script to upload into our project we put on to github right now we have our um one and also question two what i’m going to go ahead is go to vs code and then inside of here i want to create a third file and i’m going to name this retention analysis remember this is a sql file so i do dossql and then click enter we’re not going to edit it inside of here i just want to actually create it and then now going back inside of dbever and clicking inside of here i’m going to run f5 to refresh it and we now have our sql file right here that we’re gonna be working in now what are we gonna be quering well let’s go back to that cohort analysis and open it up things we definitely want are the customer key we have this clean name that we saw additionally we’re going to be using that order date and we can also use the first purchase date which will be used for some filtering that i’ll explain later so let’s start defining all this we’ll start with a select statement we’ll define that customer key the clean name the order date and that first purchase date and we want this all from cohort analysis okay let’s go ahead and run this looks like i don’t have any active connections so we’ll update it real quick to select the right data source and run all right so now we want to get well we’ll target specifically this customer 180 right here we have these two purchases we already have a column for the first purchase date but really we want to know when was their last purchase made in order to understand if they bought it within that sixmonth period so we need a way to go through and basically identify in a numerical way what is the most recent purchase and we can do this using row number and partitioning so right after order date i’ll enter inside of here and we’ll do row number we want to do a partition so we use the keyword over then we put inside parenthesis the partition by and specifically we want to partition it by that customer key and then we don’t want the we don’t want it assigning just numbers willy-nilly we want to actually specify it depending on the order date so we’ll do an order by specifying order date and we’ll name this as row number which you’ll typically see this written as rn okay let’s go ahead and run this and not bad now looking at customer 180 again we can see that we have the most recent purchase is actually number two so it goes one two we actually want it opposite of this so like their most recent purchase is numbered number one so i can change this order by to descending run this again pressing command enter and now we have it in that manner and it does order it we can also double check on some other ones here at 387 everything’s looking good so not looking bad we’re almost to what we need out of this mainly i don’t need any more duplicate duplicate entries i just want to get the most recent purchase out of here and that can be done by basically filtering for row number equal to one so what i’m going to do is put this all into a cte and then pull out what i actually need so i’m going to tab that over and we’ll give this the name of customer last purchase put it all within opening and closing parenthesis and then from here do a select statement we want that customer key clean name order date and we’re getting this from the customer last purchase remember we want to filter this right where that row number is equal to one go ahead and run this have a little typo in here put something i didn’t need to go ahead and rerun it all right looking good now order date technically now this order date is i mean it is an order date but it’s actually now the last purchase date so i’m actually going to rename this with an alias of last purchase date looking good we now need to get into classifying each one of these customers as whether they’re active or churned now i need to show you something real quick so i’m actually going to do underneath this another query we’re just going to look real quick at the order date column and that’s from the sales table i’m going to go ahead and just run just this query then i’m going to filter this in descending order and so what we can see is actually the most recent date this data ends on the 20th of april on 2024 now i can actually just query this by doing a max of the order date run controll enter and it’s still that 420 so why am i telling you this well as of filming this it is march of 2025 so we’re almost a year ahead anyway if we went back from my time now six months none of the customer there’s not any data in the system within that six-month period and so we’d have like a 0% act well we would have a 0% active rate so the point is we have to do this from our last data point of 6 months past this 420 or 6 months before 420 so what we’re going to do is this we want to now actually let’s go back and run up this query we want to now categorize these using a case statement looking back seeing if they’re within 6 months of april 20th 2024 and if that case classified as active otherwise classified as churned so in our main query down at the bottom i’m going to go ahead and put a case in and then we’ll do a when then we want to use this column we know it’s called order date when order date is less than 2024 of april 20th and specifically minus 6 months so within 6 months of that period so we’re going to do an interval specifying 6 months in this case we want to classify it then as churned so better said going back or doing this actual calculation right here if there are any purchases before october 20th 2023 we’re going to classify them as churned else we’re going to mark them as active and then we’re going to end this case statement also i want to give it an alias of customer status okay let’s go ahead and check this out i’m going to remove this extra line right here and now run command enter okay we got this error message right here invalid input syntax for type interval specifically right now for this we’re trying to do a comparison for order date basically 10 20 of 2023 this sees this as a string we need to actually cast this as a date using this operator go ahead and run command enter and now it is working now we have this extra column we can go through and actually double check it we can see here november 2023 is active december 2023 active and looks like it’s matching up pretty well now there’s one other thing or one minor detail that we need to actually filter correctly for to make sure we’re getting the right calculation i’m going to go ahead and filter this data to show what i mean all right this is it anyway these are the last purchase dates and actually this isn’t actually showing what i want to show we want to show first purchase date i’m going to show it right next to this so i’m going to call it first purchase date right here remember we have it up in the cte up above so i can just call it right here and whenever i run this it’s now here let me actually now filter by this okay this is what i’m trying to show their first purchase date was or these customers however shown was firstly on uh april 20th and there is active so if we keep scrolling back all of these for a number of period are active and so they haven’t been a customer really for six months yet so they’ve never even qualified to become churned i would argue that this would cause a bias to increase our numbers especially in 2024 for making it look like all of our customers are actually active actually and that’s what will happen in 2024 if i scroll all the way through here and we get to the beginning of the year all of these customers in 2024 will remain active when i run a percentage on this it’s going to say 100% active in 2024 which is completely useless we need to actually go back and remove everybody until 10:20 because that’s when it’s keeps on being active but then here this is when we actually start getting churn customers cuz we actually get customers that have been part of the system for greater than 6 months so all i’m going to do is we’re going to modify this query here at an a end and we want this for first purchase date basically all of this to categorize that a period of october 20th 2023 i’m going paste it right here now let’s go ahead and run this and i’m once again going to filter by that first purchase date and we don’t have a first purchase date now until after october or before october 20th when we actually start having churned and active customers so we’re going to have a lot better description or actually key statistics that actually match up with the data now we have this cleaned up we actually don’t need this first purchase date anymore that’s not something that marketing may necessarily need we now have that final table that i was getting at that we needed to get with customer key clean name when was their last purchase date and whether they are active or churned one minor note on this query i am not a fan of hard- coding values into a query because say we got a like a data dump and we get more recent data files this number may change in the system instead we could use something like a subquery so i’m going to put this all within to parentheses and then i’m going to just take this command exit and i’m going to place it right here and then also right here okay and then now whenever we run this we have exactly the same results that we had previously and this is much cleaner so way if those dates ever do change or we get new data into the system it will automatically update so we have the data or the table we need for marketing but now i want to take it a step further and actually perform an analysis on this and one we’re going to look at just holistically overall what is the active and churned rate for everybody and then we’ll finally break it down by cohort year to see how the cohorts are trending over time so let’s just get a percentage for the active and then the churned i’m going to make this now into another cte i love me some cte and we’ll call this one churned customers i’ll put that into open and closing parenthesis then for this we’re just going to start simple first i just want the customer status and i want to count of these on active verse churned for each so i’m just going to do a simple count and we’ll do this of the customer keys itself and we’ll name this as numbum customers we’re gain that from that churn customer cte and we did an aggregation so we also need to do a group by on customer status okay let’s go ahead and run this all right so not bad looking like we have around 4400 active and 4,200 churned i prefer percentages so we’re going to move forward to calculating that one thing real quick we don’t need to necessarily run distinct on this if i ran command enter still going to be the same values and that’s because in here we ended up filtering down to only where row number equals one so there should be technically only one customer key so i’m not going to i think that’s just a little unnecessary we’re not going to include it so now we need to get basically another column if you will of total customers i want to basically add these together should be around like 46,800 but if you see here we’re doing group by well we can actually use window functions to expand bigger because window functions are done after the aggregation so what we can do is we can use this once again this count here to get the counts of these keys but then we want to do a sum of this using a window function and we’re going to do this over but we’re not going to partition by any anything because we want to do all of this and we’re going to name this as total customers okay let’s go ahead and run this all right and we have i was little off on the math that’s actually 46,913 now we need to have both of these right if i just did sum and tried to run this we get an error along with if i just ran count we’d also get an error we have to do the count of them and then we want to do a sum of this to do all of this in order for it to work so now what we can do is we can divide these two values to get our percentage so i’ll take this first value here paste it down right here we’re going to divide by the total customers which is this command c command v and we’ll name this as status percentage okay let’s go ahead and run this and not too bad numbers there’s a lot of numbers here so i could just do a round function around all of this and i only really care about two decimal places we’ll go ahead and do this and now it’s down to 9% and 91% which comparing this to the industry i’m using some like perplex perplexity which is a chatbot that searched the internet to get some values anyway i asked it what’s a typical turn rate in the e-com in a e-commerce company and it says hey a turn rate of under 5% is considered good however the average turn rate for the e-commerce industry is around 22% so this one’s pretty or our company is a lot lower than industry standards all right let’s just take this one more step further and now we’re going to be forming or finding out what is this active versus churned rate for our cohort years and see how it progresses over the years now all we need for this is well we need to add another column on cohort year um but the problem is we actually need to import it higher up specifically it’s inside of our cohort analysis if i actually look inside of here you can see cohort year is there so after that first purchase date i’m going to add in cohort year and then in our second subquery i’m also going to add it in here now because we added these extra parameters up here we needed to add it into our group by to make sure that it’s working just fine specifically i’m going add in cohort year and i actually want cohort year before this okay let’s go ahead and just run this this isn’t going to be the correct calculations just yet and so we do have that cohort year inside of here we have that active verse churned as we can see we have our number of customers but our total customers are 46,000 the entire time basically this is all of our total customers and then this is driving our percentages down so for 2015 we have 1% and 6% together these two rows should equal 100% the problem is we’re dividing this 237 by 46,000 we don’t want the total customers per se we want to be the total customers of 2015 so conveniently all we have to do is inside of our window functions we just have to add in a partition by and we want to do it by cohort year so we add in cohort year go ahead and run command enter and i need to learn how to spell partition okay we got good syntax highlighting now all right and i’m seeing this this is it looks like it’s adds up to the correct amounts all right looking good i’m actually going to take this partition by and also throw it into our status percentage below so we have the correct status percentage calculated and now whenever i add up these two values it will equal 100% for all of these and as we we can see it goes from around 8% up to in more recent years up to 10% graphing it visually we can also see this trend that it’s slowly going up over the time from that 8% up to even 10% all right so only one thing left to do now is update our readme we already have our uh third sql file in there and actually i need to make sure that it’s saved so i’m going to command s it and now when i go inside of here i can see that okay we have our entire sql file next thing is our readme i’m going close out of this and make this more viewable so what did i add to the readme first i attached the readme which is linked apparently not correctly and it looks like i spelled analysis wrong so always double check your spelling anyway when updating it now whenever i click on this it actually directs right to it so it’s always good to go through and actually click any different hyperlinks or links that you attach with this next thing i went in to do is attach a visualization this one i had generated by chai gbt just copied and pasted it in and it puts that graph in that we previously had from there i moved into the key findings talking about how our churn rate stabilized around 90% for the last 2 to 3 years and then studying the fact of that retention rates are consistently low 8 to 10% way less than what the industry normal is and then finally cap it off with that newer cohorts are showing similar churn directories and basically we need to take action now to start improving these churn rates so what can we do with this data well we can work in the future to basically target those within the first year or two to improve that active rate from churned we can also combine this with other analysis and re-engage not only our churn customers but also our highv value churn customers so we can be really specific with our targeting taking this a step further we could use this analysis in predicting future churn rates and how a customer may act that goes more into data science and machine learning we’re not going to go into that but it is something that we could take away from this analysis now that we have our third and final question done we now need to get into finalizing our readme packaging it all up and putting it on github and then finally sharing on linkedin which conveniently we doing in the next chapter with that i’ll see you there and we don’t have any more practice problems for the remainder of this course so congratulations to everybody that’s been doing those practice problems with that see you in the next one welcome to this final chapter where we’re now going to go into sharing our project and first of all i want to congratulate you for making it this far and getting through this entire project it’s been quite an accomplishment thus far now this chapter only has two short lessons the first lesson which this one right here is going to be about how we can create our github repo and then our next lesson will be in actually sharing this github repo onto platform like linkedin so dialing into this lesson we’re going to be focusing on two core technologies that you may or may not be familiar with the first is git git is a version control system similar to like track changes in microsoft word anyway it tracks our changes and you can install on your computer and use it to track changes within files we’ll go more into depth of this as we go through this video but we’re going to be using git to create a repo or repository and we’re going to be pushing it into github github is an online platform that allows you to share remote repositories and remote being you can access it from anywhere and what’s great about this is it allows us to then share our project so what are the steps we’re going to be going through in this video well first thing we need to do is actually clean up that readme after that we’re going to do a deeper dive if you’re unfamiliar with repos git and github we’ll do an explanation of all of this thirdly we’ll move into installing git on your computer and getting this repo set up for those then to put onto github and then we’ll in the fourth and final one we’ll be synced between the two and i’ll show you how you can manage it so if you’ve been keeping up with your readme so far with all of the different analysis that we’ve done since the beginning there’s not a lot that we need to do to update this specifically we need to fill in an overview business questions and then finally any strategic recommendations we have from this the overview i just have this one sentence of hey it’s analysis of customer behavior retention and lifetime value for e-commerce company to improve customer retention and maximize revenue for three questions we bucket them into these three that we’ve gone into inant detail on each one of these not going to rehash it feel free to pause this video now copy whatever you want off of it after this you should have that analysis approach that we’ve gone through after each one of the questions and actually updated it to include everything we need and then finally our strategic recommendations i went through and bucket these based on our three different questions and i’ve outlined a lot of the key tactics that we can take away from it within those questions so i’m not going to rehash it here again but i highly encourage you to brainstorm and think through what are some strategic recommendations that you would take and then from there put it into this section final and last sections on technical details i just have information on what technologies were used so people are aware that yes i use postgress for this and i use chatgbt for the visualizations so this is looking good i’m going to go ahead and press command s or control s to save it and now we can get into our next steps of initializing a repo or publishing to github but i want to just cover some background knowledge first so the first concept to understand is what is a repository or as i’ll call it going forward a repo this is a personal library for your project and basically allows you to keep manage and record every change to all your different files within it like i hinted to before it’s like track changes in excel except we’re using a version control system like git in order to manage these changes and thus we can go and revisit previous editions of it if we need to now in order to create this repo we need to use a technology like git git is a free and open-source distributed version control system and it’s designed everything from small to large projects i use it all the time for my version control we’ll be installing it here in a second so what exactly is going on and how is this becoming a git repository well here is the files that kelly and i are working together for to build this course and it has all our different lesson plans in it we actually used git for the version control for building this course anyway on the surface we can see that it just has some folders and some files in there nothing special but whenever i go to unhide the hidden files on a mac i can do that by pressing command shift period it shows that there is other files in here specifically this git folder orgget folder dot being at the front mean that the file is hidden that’s why you can’t see it that’s why i unhide it anyway this tracks all the different changes inside of here of what’s going on inside of my project we don’t even need to attach that or touch it in order to adj make any adjustments it does it automatically as we go through and make changes and update the git but i just wanted to show it so you understand what is going on there i’m actually going to go back and hide it so now that we understand that git is the version control system and is what is used to create a repository we understand that there are actually two types of main repositories one is a local repository and the second is a remote repository local is as suggested it’s local it’s stored on your computer what i just showed you right here is in fact a local repository because it’s local you don’t need any internet connection it’s super fast and you can do this and it’s very common to do this for your initial development now remote repositories are stored on a server they obviously because they’re on a server and not local you need internet but because they’re now in a remote location they allow you to collaborate with others so once again i have that that local repository here but also that same repo is on github and this is a remote repository and github allows me to work with kelly as a contributor for us to work back and forth on different files so frankly it’s more than just a version control system it’s also great at collaborating with others anyway this is now bringing us into github which is one of the most popular tools for using git specifically it allows you to store those git repos here and then share them with the world so by the end of this video we’ll be publishing your project to github so you can make it publicly accessible so if it’s not clear git is the version control system it maintains that local repo and it also has uh some different command line tools we can actually type into it we’re not going to go into that in the video that’s out of scope and it’s open source and free so that’s why we’re using it github hosts these repositories it’s located via web browser interface and we can access it remotely so that way we can also collaborate with others let’s now get into actually setting up this repo and sharing it with the world we’re going to go through four different steps first we need to install git if you don’t have it second we need to create the repo within our project folder third we need to set up a github profile if you don’t have one already and then finally the fourth step we’re going to actually share it to github so let’s see if we have git before installing it i’m going to open a terminal that’s going to be what you use on a mac and on windows you should have the terminal available or you can open up a command prompt inside of here all i’m going to type is get and then d- version and click enter in my case i do have it installed so it’s going to say what the version is i don’t need to install it on the machine now if it’s not installed you’re likely to get an error message and next we can now go through actually installing it so navigate to get.sem.com and from there you’re going to download for your appro appropriate uh operating system for windows machines you’re just going to be going through the setup for this most modern computers are 64-bit so you should be fine with installing this if you have a newer computer anyway it’s going to walk you through a guey just leave all the default values the same and click okay all the way through for mac they have a couple of different options for you to install and all of them are through that command line or that terminal my recommended option is through homebrew and if you don’t have homebrew not a big deal you go to this link up here and all you have to do to install homebrew is just copy this entire code here and then paste it in and actually run it i’m not going to run it because i have homebrew installed then after it’s installed all you need to do is copy this command of brew install git and execute that in the terminal you’ll have git installed now once you have it installed you need to verify is it in fact installed all we have to do is run git attack version and it should output the version there now some of y’all may get an error when running this it’s going to say something like this please tell me who you are and it gives you the instructions run this get config message in order to provide your email and your name all you got to do is copy it and then paste it into your terminal you need to actually go through though now and actually update things like your name to what actually your name is along with that email address above and then all you got to do is press enter you’re done now we need to move on to the next steps of creating our repo creating our profile and sharing to github conveniently these steps specifically steps two and four can actually be done together if i navigate to vs code specifically you go to this source control tab right here they have two options inside of your project itself you can either initialize the repository or publish to github but publish to github actually initializes the repository and then publishes to github the important thing though is you need a github profile before you click this so if you navigate over to github.com on the homepage right there it has a sign up for you to enter your email in and then go through the sign up process to create a github account once you have that account all set up then we can proceed forward so now i can go back into vs code and in here remember i’m going to do publish to github because it’s also going to initialize our repository at the same time i’m going to click publish to github it’s going to prompt me that hey the extension github wants to sign in using github i want to allow this i select my account that i want to be associated with and then i’m navigated back to vs code it asks me this first do i want to create a private or public repository i wanted to do a public repository the next thing this is very important you get this right in the first try it is select which files should be included in the repo i don’t care about this ds store any of these hidden files per se which hidden files are anything with a dot in the front of it i do want my sql files in there i don’t need that bookmarks folder that diagrams folder that scripts folder i do need my images because they have the images for my analysis you could if you want include the scripts i don’t really care about that that’s just like throwaway scripts i’m not maintaining that in version control okay i have in my case five files selected i’m going to go ahead and select okay it’s be going through publishing and it says that hey this was successfully published to my github now what i’m going to do is navigate back to github and go to my repository i can just get to my repository or get it to my profile by clicking the name in the top upand corner one thing to note if you haven’t done this already i recommend filling out any of the different social information along with your name and adding a picture to make it look like you have a legit profile anyway if i want to look at the repos myself i can go up here to repositories tab and it has this project which i titled intermediate sql project i can go ahead and click on it and includes that images folder which i click into it has all three of my images next is my git ignore which basically ignores all those different files that we said hey don’t import into git so that’s why it’s there and you didn’t see it before then our three sql files and finally our readme which we talked about before the readme is actually going to appear down below in this section and this has all of our analysis on this homepage this is really great that it is all there this is actually the url that i would want to share to other people to showcase my work now there’s one last core concept that i want to go over before we conclude this on github and that’s on how to actually sync your profile with github or sync your project with github as you go along let’s say that we do have a change inside of our readme file itself specifically i’m going to go under analysis tools and not only do i use postgress but i also use things like dber and pg admin okay so i have these new changes in here i’m going to go ahead and do command s and save it i can now see based on these changes that we have an m next to this this means that this file is now modified and underneath the version control i can see i have this notification here for one and it’s telling me that the readme is in fact modified now just to show i’m going to refresh this page real quick just to show if i scroll on down these tools are not present here but we want to get them updated on the technical details up here so what we need to do is stage as you can see this is underneath changes we need to stage these changes in our local repository we need to give a message of what we did here and i’ll say you can just leave it super short i’m going to say we updated tools and i’m going to click commit and it’s going to say hey there are no stage changes to commit would you like to stage all of your changes and commit them directly basically previously we’d have to go through staging all the changes and then committing them we’re going to basically combine this step together and so i’m going to click yes now these are still not up on github this is only committed locally to our repo so what we’re going to do is now push it to our github repo by clicking sync changes and it’s going to say hey this action will pull and push commits from origin main which is the version that we want to get updated to in our repo we’re going to click okay and also i don’t want to show this again so i’m going click okay don’t show again so now we can see we went here from first commit to now our second commit updated tools and if i navigate over to the readme refresh this page we can now see that it is now added to it so that was an example of pushing our changes we can also pull changes basically if there are changes in our remote repo like when me and kelly work together she pushes those changes up i want to pull those changes but we’re going to just implement a change in here and then uh pull it from there so what i’m going to do is i inside of github here i can actually edit this file oh and let me actually show you what i’m going to edit so these images i have them around 50% i’m going to update this one this one’s just too big i’m going to update it to 50% i’m going to click edit and then i’m going to go to the code now this is slightly this is html code that i’m using for this i’m using an image tag you don’t need to necessarily do that all you need to understand is that i want to update the percentage and this is actually the image for question number three i want question number two oh which isn’t in an image tag so what i can do is copy this code from down here put it down and this says image what from our source specifically i want this image right here from this source so i’ll put this in we want the alt tag or the name of it to be cohort analysis and then for the width 50% height auto i’m going to go ahead and do this by the way you don’t need to necessarily use this uh html formatting for image this is just sort of fancy dancy i’m using to get this 50% anyway now that we’ve done these changes in the readme i’m going to go ahead and go hey i want to commit changes for the commit message i’m actually going to change it to more descriptive to update second image size if you want to do an extended description you can and i want to commit directly to the main branch i’m going to go ahead and commit changes now scrolling on down in that readme i can see that we have that image formatted correctly so our remote repository is updated but if i actually view our readme from our local repository we can see the number three image is pretty small but the second image is pretty big and it’s still using that other sorry it’s using our just markdown format for having our images so going over to version control we want to pull these changes down from github so i’m going to select more actions and specifically pull and we can see down below underneath it we not only have first commit update tools we now we have update second image size and closing this out and looking at this we can see from this that all the image sizes are now formatted correctly and i have that updated code here so remember there’s two main concepts that we got went over here one is pushing the changes and this is sending our local repository changes up into our remote repository in github and then secondly we can pull changes if there are any changes on the remote repository i can then pull them down into my local repository and then my local repository can be up todate now that was just a brief intro into git and github if you’re new to this and you want additional resources on it i have an entire youtube video on it a link up above and you’ll be able to go through and see even more detail to understand the ins and outs of git all right in the next lesson we’re gonna now that we have this public repository on github we’re going to go forward with sharing it on linkedin so with that see you in the next one all right welcome to this very last lesson once again congratulations for finishing that project it’s now time to get into sharing that github repo onto linkedin and we’re also going to be going through how those that purchase the course perks can also upload their course certificate which you’re going to get after completing your end of course survey so let’s navigate on over to linkedin you should have a profile if you don’t i highly recommend that you create one this is where employers are at and this is where they’re checking your work anyway i’m here on my profile in here they have sections on and about featured activity but what i care about is this section here on licenses and certifications this is where we’re going to be uploading your course certificate remember you complete the end of course survey i’ll be emailing it to you you’ll not only get a link but you can also download the physical certificate as well now if you’re not seeing licenses and certifications you can actually come all the way up to the top click add profile section and underneath recommended you can click add licenses and certifications anyway let’s add that certificate to it by clicking this plus icon from there you’ll go through and fill this all in here i filled in with intermediate sql for data analytics put me as the issuing organization i have the uh issue date of march 2025 the certificate never expires so you don’t put an expiration date in there’s a credential it id located on the certificate so put that in along with the url underneath skills you can list up to five skills i’d recommend these of postgress sql git github and dbaver finally i also like including the image of the certificate itself so i select add media and then add media and then attach the file itself in there give it an appropriate title from there click apply all right once you have everything in there all you got to do is click save and it’s there the next thing you’ll do is actually update this project because we have our certificate but we also part of this course a project so underneath this project section i’m going to click the add icon from there i give it a name of intermediate sql sales analysis short description that i stole from our readme fill in the five core skills that were saying from that certificate next is the media and i like to include a link specifically a link to the github repo so what i’m going to do is just grab this url right here and copy it and then paste it right into here and then click add i’m liking everything that it has for this i’m going to go ahead and click apply now that we have our media in we can do our start date i started this bad boy making this back in october of 2024 and i just finished it this month in march of 2025 if you worked on with somebody like kelly mcklly so i could add her as a contributor and i can associate other projects i don’t have that okay now i’d go through and click save and my project’s going to be updated to my linkedin the last thing i’m going to recommend to do is sharing a social media post on linkedin or making a post if you will of this project completion to let everybody know that you’ve done this in it i’d call out that you completed the course and also did the project don’t forget to tag me and kelly in this as it’s super awesome i love going through this and being able to see all the different work and i can comment on it as well once again congratulations on all the work that you’ve put into completing this course and also this project what are the next steps well i have coming up shortly in the next few months we’ll be releasing an advanced sql for data analytics course which i’ll link somewhere on here that you can go and check it out if you’re interested in with that don’t forget to follow me on linkedin and smash that like button see you in the next one

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • Exploring SQL for Database Management and Data Analysis

    Exploring SQL for Database Management and Data Analysis

    The provided texts offer a comprehensive introduction to databases and SQL, covering fundamental concepts like tables, columns, and records, alongside essential SQL commands for data manipulation and querying. They further explore the role of SQL in data analysis, outlining necessary skills, qualifications, project work, portfolios, and internships for aspiring data analysts. Advanced SQL topics such as joins, subqueries, stored procedures, triggers, views, and window functions are examined in detail through explanations and practical examples using MySQL. Finally, the material transitions to PostgreSQL, demonstrating similar SQL functionalities and introducing more advanced features like case statements, aggregate functions, and user-defined functions, while also discussing the importance and top certifications in the field of data analytics.

    SQL Fundamentals Study Guide

    Quiz

    1. What is the purpose of the GROUP BY clause in SQL? Provide a brief example of its syntax. The GROUP BY clause in SQL is used to group rows that have the same values in one or more columns into summary rows. It is often used with aggregate functions to calculate metrics for each group. For example: SELECT department, COUNT(*) FROM employees GROUP BY department;
    2. Explain the difference between the WHERE clause and the HAVING clause in SQL. When would you use each? The WHERE clause filters individual rows based on a specified condition before any grouping occurs. The HAVING clause filters groups based on a specified condition after grouping has been performed by the GROUP BY clause. You use WHERE to filter individual records and HAVING to filter groups of records.
    3. Describe the main categories of SQL data types discussed in the source material. Give one example for each category. The source material outlines several main categories of SQL data types: exact numeric (e.g., INTEGER), approximate numeric (e.g., FLOAT), date and time (e.g., DATE), string (e.g., VARCHAR), and binary (e.g., BINARY).
    4. List three types of SQL operators and provide a brief explanation of what each type is used for. Three types of SQL operators are: arithmetic operators (used for mathematical calculations like addition: +), logical operators (used to combine or modify conditions, like AND), and comparison operators (used to compare values, like equal to: =).
    5. What are SQL joins used for? Briefly explain the purpose of an INNER JOIN. SQL joins are used to combine rows from two or more tables based on a related column between them. An INNER JOIN returns only the rows where there is a match in both tables based on the join condition; rows with no match in either table are excluded.
    6. What is a subquery in SQL? Provide a simple example of how a subquery might be used. A subquery is a query nested inside another SQL query (such as SELECT, FROM, or WHERE). It is often used to retrieve data that will be used in the main query’s conditions. For example: SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);
    7. Explain the concept of a stored procedure in SQL. What are some potential benefits of using stored procedures? A stored procedure is a set of SQL statements with an assigned name, which is stored in the database. Benefits include reusability of code, improved performance (as they are pre-compiled), and enhanced security by granting access only to the procedure rather than the underlying tables.
    8. What is a trigger in SQL? Describe a scenario where a trigger might be useful. A trigger is a stored program that automatically executes in response to certain events (e.g., INSERT, UPDATE, DELETE) on a particular table. A trigger could be useful for automatically updating a timestamp field whenever a row in a table is modified, ensuring data integrity or auditing changes.
    9. Describe what a view is in SQL. How does it differ from a regular table? A view is a virtual table based on the result of an SQL statement. Unlike regular tables, views do not store data themselves; instead, they provide a customized perspective of data from one or more underlying tables. Changes made through a simple view might affect the base tables, but complex views are often read-only.
    10. What is the purpose of the ORDER BY clause in SQL? Explain how to sort results in descending order. The ORDER BY clause is used to sort the result set of a SQL query based on one or more columns. To sort results in descending order, you specify the column(s) to sort by and append the DESC keyword after the column name(s). For example: SELECT * FROM products ORDER BY price DESC;

    Answer Key

    1. The GROUP BY clause in SQL groups rows with the same values in specified columns, often used with aggregate functions for summarized data. Example: SELECT department, COUNT(*) FROM employees GROUP BY department;
    2. WHERE filters rows before grouping, while HAVING filters groups after GROUP BY. Use WHERE for record-level conditions and HAVING for group-level conditions on aggregated results.
    3. The main categories are exact numeric (e.g., INTEGER), approximate numeric (e.g., FLOAT), date and time (e.g., DATE), string (e.g., VARCHAR), and binary (e.g., BINARY).
    4. Arithmetic operators perform calculations (+, -, *, /, MOD). Logical operators combine conditions (AND, OR, NOT). Comparison operators evaluate relationships between values (=, <>, >, <, >=, <=).
    5. SQL joins combine rows from multiple tables based on related columns. INNER JOIN returns only matching rows from both tables based on the join condition.
    6. A subquery is a query nested within another query, often used to provide values for conditions in the outer query. Example: SELECT * FROM products WHERE price > (SELECT AVG(price) FROM products WHERE category = ‘Electronics’);
    7. A stored procedure is a pre-compiled set of SQL statements stored in the database, offering benefits like code reuse, improved performance, and enhanced security.
    8. A trigger is a database object that automatically executes SQL code in response to specific events on a table. Useful for auditing changes by logging every update to a separate history table.
    9. A view is a virtual table based on the result of a query, providing a specific perspective on the data without storing it directly. It differs from a regular table by not holding persistent data.
    10. The ORDER BY clause sorts the query result set. To sort in descending order, use the DESC keyword after the column name in the ORDER BY clause (e.g., ORDER BY salary DESC).

    Essay Format Questions

    1. Discuss the importance of data types in SQL. Explain how choosing the appropriate data type for a column can impact database performance and data integrity. Provide specific examples of scenarios where different data types would be most suitable.
    2. Elaborate on the different types of SQL joins (INNER, LEFT, RIGHT, FULL). Explain the conditions under which each type of join is most useful and provide conceptual examples illustrating the results of each join type using sample tables.
    3. Analyze the benefits and drawbacks of using stored procedures and triggers in SQL database design. Consider aspects such as performance, maintainability, security, and complexity. Provide scenarios where each would be a particularly advantageous or disadvantageous choice.
    4. Explain the concept and benefits of using views in SQL. Discuss how views can contribute to data security, query simplification, and data abstraction. Describe different types of views and their specific use cases.
    5. Compare and contrast the use of subqueries and joins in SQL for retrieving data from multiple tables. Discuss the scenarios where one approach might be preferred over the other, considering factors such as readability, performance, and the complexity of the relationships between tables.

    Glossary of Key Terms

    • Clause: A component of an SQL statement that performs a specific function (e.g., SELECT, FROM, WHERE, GROUP BY, ORDER BY).
    • Data Type: The attribute that specifies the type of data that a column can hold (e.g., numeric, string, date).
    • Operator: Symbols or keywords used to perform operations in SQL expressions (e.g., arithmetic, logical, comparison).
    • Join: An SQL operation that combines rows from two or more tables based on a related column.
    • Inner Join: Returns rows only when there is a match in both tables based on the join condition.
    • Outer Join (Left, Right, Full): Returns all rows from one table and the matching rows from the other; if no match, NULLs are used for the non-matching table.
    • Subquery (Nested Query): A query embedded inside another SQL query.
    • Stored Procedure: A pre-compiled collection of SQL statements stored in the database.
    • Trigger: A database object that automatically executes a block of SQL code in response to certain events on a table.
    • View: A virtual table based on the result of an SQL SELECT statement.
    • Aggregate Function: A function that performs a calculation on a set of values and returns a single summary value (e.g., COUNT, SUM, AVG, MIN, MAX).
    • GROUP BY Clause: Groups rows with the same values in one or more columns.
    • HAVING Clause: Filters the results of a GROUP BY clause based on specified conditions.
    • WHERE Clause: Filters rows based on specified conditions before grouping.
    • ORDER BY Clause: Sorts the result set of a query based on specified columns.
    • DESC: Keyword used with ORDER BY to sort in descending order.
    • ASC: Keyword used with ORDER BY to sort in ascending order (default).
    • Alias: A temporary name given to a table or column in a SQL query for brevity or clarity.

    Briefing Document: Review of SQL Concepts and MySQL/PostgreSQL Usage

    This briefing document summarizes the main themes, important ideas, and facts presented across the provided sources, which primarily focus on introducing and demonstrating various aspects of SQL using MySQL and PostgreSQL.

    Main Themes:

    • Fundamentals of SQL: The sources cover core SQL concepts, including data manipulation language (DML) commands (SELECT, INSERT, UPDATE, DELETE), data definition language (DDL) commands (CREATE TABLE, ALTER TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, CREATE VIEW, DROP VIEW), clauses (WHERE, GROUP BY, HAVING, ORDER BY, JOIN, LIMIT), data types, operators, and basic SQL functions.
    • Database Management Systems: The documents illustrate the practical application of SQL within two popular database management systems: MySQL and PostgreSQL. This includes installation (for MySQL), connecting to servers, and executing SQL commands within their respective interfaces (MySQL Workbench, command-line interface, and online compilers for PostgreSQL).
    • Data Filtering and Sorting: A significant portion of the content focuses on how to effectively filter data using the WHERE and HAVING clauses and how to sort results using the ORDER BY clause. The use of comparison operators, logical operators (AND, OR, BETWEEN, LIKE, NOT LIKE), and pattern matching is highlighted.
    • Data Aggregation: The GROUP BY and HAVING clauses are explained and demonstrated for summarizing data based on groups, along with aggregate functions like COUNT, SUM, AVG, MAX, and MIN.
    • Joining Tables: The concept of joining data from multiple tables is introduced, with a focus on INNER JOIN and the importance of common fields for linking tables.
    • Advanced SQL Concepts: The sources delve into more advanced topics such as subqueries (nested queries), views (virtual tables), stored procedures (reusable SQL code), triggers (actions performed automatically in response to database events), Common Table Expressions (CTEs/WITH expressions), and window functions (for analytical queries).
    • SQL Functions: Various built-in SQL functions are explained and demonstrated, including mathematical functions (ABS, GREATEST, LEAST, MOD, POWER, SQRT, SIN, COS, TAN, CEILING, FLOOR) and string functions (CHARACTER_LENGTH, CONCAT, LEFT, RIGHT, SUBSTRING/MID, REPEAT, REVERSE, LTRIM, RTRIM, TRIM, POSITION, ASCII).
    • Practical Application and Examples: The sources heavily rely on practical examples and demonstrations within MySQL Workbench and online PostgreSQL environments to illustrate the usage and benefits of different SQL concepts and commands.
    • Database Connectivity with Python: One source provides a basic introduction to connecting to a MySQL database using Python, creating databases and tables, inserting data, and executing queries.
    • Common Interview Questions: One section focuses on typical SQL interview questions, covering topics like INDEX, GROUP BY, ALIAS, ORDER BY, differences between WHERE and HAVING, VIEW, and STORED PROCEDURE.

    Most Important Ideas and Facts (with Quotes):

    • SQL Clauses for Data Manipulation:“we condition one condition two and so on then we have the group by Clause that takes various column names so you can write Group by column 1 column 2 and so on next we have the having Clause to filter out tables based on groups finally we have the order by Clause to filter out the result in ascending or descending order” (01.pdf) – This outlines the basic structure and purpose of key SQL clauses.
    • The WHERE clause filters rows before grouping, while the HAVING clause filters groups after they are formed.
    • SQL Data Types: The document lists various SQL data types, categorizing them as exact numeric (integer, small int, bit, decimal), approximate numeric (float, real), date and time (date, time, timestamp), string (char, varchar, text), and binary (binary, varbinary, image).
    • SQL Operators: Basic arithmetic, logical (all, and, any, or, between, exists), and comparison operators (=, !=, >, <, >=, <=, NOT <, NOT >) are fundamental for constructing SQL queries.
    • MySQL Workbench Installation: The source provides a step-by-step guide to installing MySQL Workbench on Windows, including downloading the installer from the official Oracle website (myql.com), choosing a custom setup, and selecting components like MySQL Server, MySQL Shell, and MySQL Workbench. The importance of setting a password for the root user is emphasized: “now here set the password for your root user by the way root is the default user this user will have access to everything” (01.pdf).
    • Basic MySQL Commands: Commands like SHOW DATABASES, USE <database_name>, SHOW TABLES, SELECT * FROM <table_name>, and DESCRIBE <table_name> are introduced as essential for navigating and inspecting database structures.
    • Creating Tables: The CREATE TABLE command syntax is explained, including defining column names and their data types, and specifying constraints like PRIMARY KEY and NOT NULL.
    • Inserting Data: The INSERT INTO command is used to add new rows into a table, specifying the table name and the values for each column.
    • String Functions:“there’s also a function called position in MySQL the position function Returns the position of the first occurrence of a substring in a string” (01.pdf)
    • “the asky function Returns the asky value for a specific character” (01.pdf)
    • PostgreSQL’s string functions like CHARACTER_LENGTH, CONCAT, LEFT, RIGHT, REPEAT, and REVERSE provide powerful text manipulation capabilities.
    • GROUP BY and Aggregate Functions: The GROUP BY clause groups rows with the same values in specified columns, allowing the application of aggregate functions to each group.
    • HAVING Clause for Filtering Groups: “the having Clause works like the wear Clause the difference is that wear Clause cannot be used with aggregate functions the having Clause is used with a group by Clause to return those rows that meet a condition” (Source 17.pdf).
    • JOIN Operations: SQL joins (INNER JOIN is primarily discussed) are used to combine rows from two or more tables based on related columns.
    • Subqueries (Nested Queries): A subquery is a query embedded within another SQL query, used to retrieve data that will be used in the main query’s conditions.
    • Views (Virtual Tables):“views are actually virtual tables that do not store any data of their own but display data stood in other tables views are created by joining one or more tables” (01.pdf).
    • Views simplify complex queries and can enhance data security. The CREATE VIEW, RENAME TABLE (for renaming views), and DROP VIEW commands are used to manage views.
    • Stored Procedures:“a stored procedure is an SQL code that you can save so that the code can be reused over and over again” (01.pdf).
    • Stored procedures can take input parameters (IN parameters) and help in encapsulating and reusing SQL logic.
    • Triggers: Triggers are SQL code that automatically executes in response to certain events (e.g., BEFORE INSERT, AFTER UPDATE) on a table.
    • Window Functions: Introduced in MySQL 8.0, window functions perform calculations across a set of table rows that are related to the current row, allowing for analytical queries (e.g., calculating total salary per department using SUM() OVER (PARTITION BY)). The RANK(), DENSE_RANK(), and FIRST_VALUE() functions are examples of window functions.
    • Common Table Expressions (CTEs): CTEs, defined using the WITH keyword, are temporary, named result sets defined within the scope of a single query, improving readability and allowing for recursive queries.
    • Database Connectivity with Python: The mysql.connector library in Python can be used to connect to MySQL databases, execute SQL queries, and retrieve results. The basic steps involve creating a server connection, creating databases, connecting to specific databases, and executing queries using cursors.
    • PostgreSQL Specifics: The sources also demonstrate SQL concepts within a PostgreSQL environment using online compilers, highlighting similar SQL syntax and the availability of functions like BETWEEN, LIKE for pattern matching (% for any sequence of characters, _ for a single character), and various mathematical and string functions. The ALTER TABLE … RENAME COLUMN command is shown for modifying table schema. The LIMIT clause in PostgreSQL restricts the number of rows returned by a query.
    • SQL Interview Preparedness: The final source provides insights into common SQL interview questions, emphasizing understanding of fundamental concepts and practical application.

    Overall Significance:

    The provided sources offer a comprehensive introduction to fundamental and advanced SQL concepts, demonstrating their application in both MySQL and PostgreSQL. They emphasize practical learning through examples and hands-on exercises, making them valuable resources for individuals learning SQL or preparing for database-related tasks and interviews. The inclusion of database connectivity with Python further highlights the role of SQL in broader data management and application development contexts.

    Understanding Fundamental SQL Concepts and Operations

    1. What are the fundamental components of a SQL query?

    A fundamental SQL query typically involves the SELECT statement to specify the columns you want to retrieve, the FROM clause to indicate the table(s) you are querying, and optionally, the WHERE clause to filter rows based on specific conditions. Additionally, you might use GROUP BY to group rows with the same values, HAVING to filter groups, and ORDER BY to sort the result set in ascending (ASC) or descending (DESC) order.

    2. What are the common data types available in SQL?

    SQL supports various data types to define the kind of data a column can hold. These include exact numeric types like INT, SMALLINT, BIT, and DECIMAL; approximate numeric types such as FLOAT and REAL; date and time types like DATE, DATETIME, and TIMESTAMP; string data types including CHAR, VARCHAR, and TEXT; and binary data types such as BINARY, VARBINARY, and IMAGE.

    3. What are the different categories of operators used in SQL?

    SQL uses several categories of operators. Arithmetic operators perform mathematical operations (+, -, *, /, MOD). Logical operators (ALL, ANY, OR, BETWEEN, EXISTS, etc.) are used to combine or negate conditions. Comparison operators (=, !=, >, <, >=, <=, NOT <, NOT >) are used to compare values.

    4. How can you set up and connect to a MySQL database using MySQL Workbench and the command line?

    To set up MySQL, you typically download the MySQL Installer from the official Oracle website. During the installation, you can choose to install MySQL Server, MySQL Shell, and MySQL Workbench. You’ll need to configure the server instance, set a password for the root user, and execute the configuration.

    To connect via MySQL Workbench, you open the application, click on the local instance connection, and enter your root password.

    To connect via the command line, you need to navigate to the bin directory of your MySQL installation using the cd command in the command prompt. Then, you can use the command mysql -u root -p, and upon entering your password, you’ll be connected to the MySQL server.

    5. What are some basic SQL commands for database and table manipulation?

    Some basic SQL commands include:

    • SHOW DATABASES; to list the existing databases.
    • USE database_name; to select a specific database to work with.
    • SHOW TABLES; to list the tables within the selected database.
    • SELECT * FROM table_name; to view all rows and columns in a table.
    • DESCRIBE table_name; or DESC table_name; to show the structure of a table (column names, data types, etc.).
    • CREATE DATABASE database_name; to create a new database.
    • CREATE TABLE table_name (column1 datatype, column2 datatype, …); to create a new table with specified columns and data types.
    • DROP TABLE table_name; to delete a table.
    • DROP DATABASE database_name; to delete a database.

    6. How do GROUP BY and HAVING clauses work in SQL?

    The GROUP BY clause in SQL is used to group rows in a table that have the same values in one or more columns into summary rows. It is often used with aggregate functions (like COUNT, MAX, MIN, AVG, SUM) to compute values for each group.

    The HAVING clause is used to filter the results of a GROUP BY clause. It allows you to specify conditions that must be met by the groups. The key difference from the WHERE clause is that WHERE filters individual rows before grouping, while HAVING filters groups after they have been formed.

    7. What are SQL JOINs and what are some common types?

    SQL JOINs are used to combine rows from two or more tables based on a related column between them. This allows you to retrieve data from multiple tables in a single query. Common types of JOINs include:

    • INNER JOIN: Returns rows only when there is a match in both tables.
    • LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table and the matching rows from the right table. If there’s no match in the right table, NULLs are used for the right table’s columns.
    • RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table and the matching rows from the left table. If there’s no match in the left table, NULLs are used for the left table’s columns.
    • FULL OUTER JOIN: Returns all rows when there is a match in either the left or right table. If there is no match in one of the tables, NULLs are used for the columns of the table without a match. (Note: MySQL does not directly support FULL OUTER JOIN, but it can be simulated using UNION ALL with LEFT JOIN and RIGHT JOIN).

    JOIN conditions are typically specified using the ON keyword, indicating which columns should be compared for equality.

    8. What are subqueries and stored procedures in SQL?

    A subquery (or inner query) is a query nested inside another SQL query. Subqueries can be used in the SELECT, FROM, WHERE, and HAVING clauses. They are often used to retrieve data that will be used in the conditions or selections of the outer query. Subqueries can return single values, lists of values, or even entire tables.

    A stored procedure is a set of SQL statements with an assigned name, which is stored in the database. Stored procedures can be executed by calling their name. They offer several benefits, such as code reusability, improved performance (as the code is pre-compiled and stored on the server), and enhanced security by granting execute permissions without direct table access. Stored procedures can also accept input parameters and return output parameters.

    Understanding Relational Database Tables

    In relational databases, data is stored in the form of tables. These tables are the fundamental structure for organizing and managing data. You can think of a table as a grid composed of rows and columns.

    Here’s a breakdown of the structure of a database table:

    • Table Name: Each table has a name that identifies the data it holds, for example, “players”, “employees”, “customers”, or “orders”.
    • Columns (or Fields or Attributes):
    • Columns are the vertical structures in a table.
    • Each column represents a specific attribute or category of information about the items stored in the table.
    • At the top of each column is a column name (also known as a field name) that describes the data in that column, such as “player ID”, “player name”, “country”, and “goals scored” in a “players” table. Other examples include “employee_ID”, “employee_name”, “age”, “gender”, “date of join”, “department”, “city”, and “salary” in an “employees” table.
    • Each column is associated with a specific data type that defines the kind of values it can hold. Examples of data types in SQL include integer, smallint, decimal, float, real, date, time, varchar, char, text, binary, etc.. The data type ensures that all values stored in a specific column are of the same type or domain.
    • Columns are also sometimes referred to as fields in a database.
    • Rows (or Records or Tuples):
    • Rows are the horizontal structures in a table.
    • Each row represents a single instance or record (also called a tuple) of the entity that the table describes.
    • For example, in a “players” table, each row would contain the information for one specific player. In an “employees” table, each row would contain the details of a single employee.
    • Cells: The intersection of a row and a column forms a cell, which holds a single piece of data. Each column in a row will contain only one value per row, which is a rule for the first normal form of normalization.
    • Primary Key: A primary key is a special column or a set of columns that uniquely identifies each row in a table. It ensures that no two rows have the same primary key value, and it cannot contain null or empty values. Primary keys are crucial for linking tables together and maintaining data integrity. For instance, “employee_ID” could serve as a primary key in an “employees” table.
    • Index: Tables can be indexed on one or more columns to speed up the process of finding relevant information. An index creates a sorted structure that allows the database to locate specific rows more efficiently without having to scan the entire table.

    SQL (Structured Query Language) commands are used to interact with these tables. You can use SQL to query (retrieve), update, insert, and delete records in a table. The SELECT statement is used to retrieve data by specifying the columns you want to see and optionally filtering the rows based on certain conditions using the WHERE clause. INSERT is used to add new rows to a table, UPDATE to modify existing rows, and DELETE to remove rows.

    The logical structure of a database, including its tables and their relationships, can be visually represented using an Entity-Relationship (ER) diagram. An ER diagram shows entities (which often correspond to tables) and their attributes (which correspond to columns) and the relationships between these entities. This helps in understanding the information to be stored in a database and serves as a blueprint for database design.

    Understanding SQL: Core Concepts and Commands

    SQL (Structured Query Language) is a domain-specific language that serves as the backbone of data management and analysis for relational databases. It is the standard language used by most databases to communicate with and manipulate data. Initially developed by IBM, SQL allows users to interact with databases to store, process, analyze, and manage data effectively. As businesses become increasingly data-driven, proficiency in SQL is a crucial skill for data analysts, developers, and database administrators.

    Here are key aspects of the SQL query language based on the sources:

    • Core Functionality: SQL queries enable you to access any information stored in a relational database. This includes retrieving specific data, updating existing records, inserting new data, and deleting unwanted information.
    • Efficiency: SQL is designed to extract data from databases in a very efficient way. By specifying precisely what data you need and the conditions it must meet, you can minimize the amount of data processed and transferred.
    • Compatibility: The Structured Query Language is compatible with all major database systems, ranging from Oracle and IBM to Microsoft SQL Server and open-source options like MySQL and PostgreSQL.
    • Ease of Use: SQL is designed to manage databases without requiring extensive coding. Its syntax is relatively straightforward, focusing on declarative statements that specify what data should be retrieved or modified, rather than how to perform the operation.
    • Applications of SQL: SQL has a wide range of applications, including:
    • Creating databases and defining their structure (e.g., creating tables with specific columns and data types).
    • Implementing and maintaining existing databases.
    • Entering, modifying, and extracting data within a database. For instance, you can use INSERT to add new records, UPDATE to change existing ones, and SELECT to retrieve data.
    • Serving as a client-server language to connect the front-end of applications with the back-end databases that store the application’s data.
    • Protecting databases from unauthorized access when deployed as Data Control Language (DCL).
    • Types of SQL Commands: SQL commands are broadly categorized into four main types:
    • Data Definition Language (DDL): These commands are used to change the structure of the database objects such as tables. Examples include CREATE (to create tables), ALTER (to modify table structure), DROP (to delete tables), and TRUNCATE (to remove all rows from a table). DDL commands are auto-committed, meaning changes are permanently saved.
    • Data Manipulation Language (DML): These commands are used to modify the data within the database. Examples include SELECT (to retrieve data), INSERT (to add new rows), UPDATE (to modify existing rows), and DELETE (to remove rows). DML commands are not auto-committed, allowing for rollback of changes. The SELECT command is also referred to as Data Query Language (DQL).
    • Data Control Language (DCL): These commands control access to data within the database, managing user privileges and permissions. Examples include GRANT (to give users access rights) and REVOKE (to remove access rights).
    • Transaction Control Language (TCL): These commands manage database transactions. Examples include COMMIT (to save changes permanently) and ROLLBACK (to undo changes).
    • Basic SQL Command Structure: A typical SQL query follows a basic structure:
    • SELECT column1, column2, …
    • FROM table_name
    • WHERE condition(s)
    • GROUP BY column(s)
    • HAVING group_condition(s)
    • ORDER BY column(s) ASC|DESC;
    • The SELECT statement specifies the columns you want to retrieve. You can use SELECT * to select all columns.
    • The FROM statement indicates the table from which to retrieve the data.
    • The optional WHERE clause filters rows based on specified conditions. You can use comparison operators (e.g., >, =, <), logical operators (AND, OR, NOT), BETWEEN to select within a range, and IN to specify multiple values.
    • The optional GROUP BY clause groups rows that have the same values in one or more columns into summary rows, often used with aggregate functions.
    • The optional HAVING clause filters groups based on specified conditions (used with GROUP BY).
    • The optional ORDER BY clause sorts the result set in ascending (ASC) or descending (DESC) order based on one or more columns.
    • Data Types: SQL supports various data types to define the kind of data each column can hold, including exact numeric (integer, smallint, decimal), approximate numeric (float, real), date and time (date, time, timestamp), string (char, varchar, text), and binary data types (binary, varbinary, image).
    • Operators: SQL uses different types of operators to perform operations in queries, such as arithmetic operators (+, -, *, /), logical operators (ALL, ANY, BETWEEN, EXISTS, IN, LIKE, NOT, OR), and comparison operators (=, !=, >, <, >=, <=).
    • Functions: SQL provides built-in functions to perform various operations on data, including:
    • Aggregate functions: Calculate a single value from a set of rows (e.g., COUNT, SUM, AVG, MIN, MAX).
    • String functions: Manipulate text data (e.g., LENGTH, UPPER, LOWER, SUBSTRING, CONCAT, TRIM, POSITION, LEFT, RIGHT, REPEAT, REVERSE).
    • Date and time functions: Work with date and time values (e.g., CURDATE, DAY, NOW).
    • Mathematical functions: Perform mathematical calculations (e.g., ABS, GREATEST, LEAST, ROUND).
    • Joins: SQL allows you to combine data from two or more tables based on a related column. Different types of joins include INNER JOIN (returns rows only when there is a match in both tables), LEFT JOIN (returns all rows from the left table and matching rows from the right), RIGHT JOIN (returns all rows from the right table and matching rows from the left), and FULL OUTER JOIN (returns all rows when there is a match in either left or right table). UNION operator can also be used to combine the result sets of two or more SELECT statements.
    • Subqueries: A subquery (or inner query) is a query nested inside another SQL query. Subqueries can be used in the WHERE, SELECT, and FROM clauses to retrieve data that will be used by the outer query.
    • Stored Procedures: These are pre-compiled SQL statements that can be executed as a single unit. They can take parameters and return values, helping to encapsulate business logic and improve performance.
    • Triggers: Triggers are special types of stored procedures that automatically run when a specific event occurs in the database server (e.g., before or after an INSERT, UPDATE, or DELETE operation on a table).

    In summary, SQL is a powerful and versatile language essential for interacting with relational databases. It provides a structured way to define, manipulate, and retrieve data, making it a cornerstone of modern data management and analysis.

    Essential Skills for Aspiring Data Analysts

    Based on the sources, becoming a data analyst requires a combination of technical and soft skills. The document “01.pdf” outlines several key skill areas for aspiring data analysts.

    According to the source, the steps to become a data analyst include focusing on skills as the first crucial step. These skills are categorized into six main areas:

    • Microsoft Excel Proficiency: While advanced tools exist, proficiency in Excel remains vital for data analysts. Its versatility in data manipulation, visualization, and modeling is unmatched, making it a foundational tool for initial data exploration and basic analysis.
    • Data Management and Database Management Skills: This is indispensable for data analysts as the volume of data grows. Efficient management and retrieval from databases are critical. Proficiency in DBMS systems and querying languages like SQL ensures analysts can access and manipulate data seamlessly. As we discussed previously, SQL is the backbone of data management and analysis. It allows data analysts to access any information stored in a relational database with SQL queries. This includes writing queries, joining tables, and using subqueries.
    • Statistical Analysis: This skill allows analysts to uncover hidden trends, patterns, and correlations within data, facilitating evidence-based decision-making. It empowers analysts to identify the significance of findings, validate hypotheses, and make reliable predictions.
    • Programming Languages (e.g., Python, R): Proficiency in programming languages like Python is essential for data analysis. These languages enable data manipulation, advanced statistical analysis, and machine learning implementations. The source also mentions R programming language as one of the tools a data analyst should be familiar with.
    • Data Storytelling and Data Visualization: This skill is paramount for data analysts. Data storytelling bridges the gap between data analysis and actionable insights, ensuring that the value of data is fully realized. The ability to present insights clearly and persuasively is crucial as data complexity grows. Tools like Tableau and Power BI are mentioned as data visualization tools.
    • Problem Solving and Soft Skills: Strong problem-solving skills are important for data analysts when dealing with complex data challenges and evolving analytical methodologies. Analysts must excel in identifying issues, formulating hypotheses, and devising innovative solutions. In addition to technical skills, data analysts in 2025 will require strong soft skills to excel. These include:
    • Communication: Data analysts must effectively communicate their findings to both technical and non-technical stakeholders, presenting complex data in a clear and understandable manner.
    • Teamwork and Collaboration: Data analysts often work with multidisciplinary teams alongside data scientists, data engineers, and business professionals. Collaborative skills are essential for sharing insights, brainstorming solutions, and working cohesively towards common goals.
    • Domain Knowledge: Knowledge of the domain in which the analyst is working (e.g., pharmaceutical, banking, automotive) is really important. Without a basic foundational domain knowledge, it can be difficult to provide accurate results.

    In summary, a data analyst needs a blend of technical skills in data manipulation (including SQL and Excel), statistical analysis, programming, and data visualization, along with crucial soft skills in communication, teamwork, and problem-solving, complemented by domain knowledge. Our earlier discussions highlight the fundamental role of SQL in a data analyst’s toolkit for interacting with databases [1, Your previous turn].

    Understanding Database Management and SQL

    Based on the sources, database management encompasses the organized collection of structured information or data, typically stored electronically in a computer system. This data is managed using a Database Management System (DBMS), which acts as a storage system for the collection of data.

    Here are key aspects of database management as discussed in the sources:

    • Role of a DBMS: A DBMS is crucial for controlling and managing databases. It provides the necessary tools and functionalities to ensure data is easily retrieved, managed, and updated.
    • Relational Databases: A significant aspect of database management discussed in the source is relational databases. These systems store data in the form of tables. This tabular structure allows for organizing data into tables, rows (records or tuples), and columns (fields).
    • Organization and Indexing: In relational databases, data can be organized into tables with specific structures. Furthermore, data can be indexed to make it easier to find relevant information. An index helps speed up data retrieval operations. A table consists of:
    • Column Names (Fields): These are the attributes of the data stored in the table (e.g., player ID, player name, country, goals scored). Each column should have a unique name. All values within a specific column should be of the same data type or domain.
    • Rows (Records or Tuples): Each row represents a single instance of the entity being described by the table (e.g., information about a specific player).
    • SQL for Database Management: As highlighted in our previous conversation, SQL (Structured Query Language) is a domain-specific language used to communicate with databases [1, Your previous turn]. It plays a vital role in database management by allowing users to:
    • Query databases to retrieve specific information.
    • Update databases to modify existing data.
    • Insert records to add new data.
    • Perform many other tasks related to managing and manipulating data.
    • Store, process, analyze, and manipulate databases.
    • Create a database and define its structure.
    • Maintain an already existing database.
    • Popular Databases: The source lists several popular database systems, including:
    • MySQL.
    • Oracle Database.
    • MongoDB (a NoSQL database).
    • Microsoft SQL Server.
    • Apache Cassandra (a free and open-source NoSQL database).
    • PostgreSQL.
    • Database Management Skills for Data Analysts: Our previous discussion on data analyst skills emphasized that data management and database management skills are indispensable for data analysts [Your previous turn, 3]. The increasing volume of data necessitates efficient management and retrieval from databases, making proficiency in DBMS systems and querying languages like SQL critical. Data analysts need to be able to access and manipulate data seamlessly using SQL.

    In essence, database management involves the strategic organization, storage, retrieval, and manipulation of data using a DBMS. Relational databases, structured in tables, are a common model, and SQL is the primary language used to interact with these systems for various management tasks. These skills are fundamental for professionals like data analysts who work with data to derive insights and support decision-making.

    SQL for Data Analysis Functions

    Based on the sources and our conversation history, data analysis functions involve the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. SQL plays a crucial role in performing many of these functions when the data resides in relational databases [1, Your previous turn, Your previous turn].

    Here are some key data analysis functions that can be performed using SQL, as supported by the sources:

    • Data Retrieval and Selection: SQL’s SELECT statement is fundamental for retrieving specific data required for analysis. You can choose particular columns from one or more tables. For example, to analyze player performance, you might select player name and goals scored from a players table.
    • Filtering Data: To focus on relevant subsets of data, the WHERE clause in SQL allows you to filter records based on specified conditions. For instance, you might analyze data only for players from a specific country.
    • Sorting Data: The ORDER BY clause enables you to sort the retrieved data based on one or more columns, which can help in identifying trends or outliers. You could sort players by the number of goals scored in descending order to see the top performers.
    • Removing Duplicates: The DISTINCT keyword is used to retrieve only unique values from a column, which can be important for accurate analysis, such as finding the number of unique cities represented in a dataset.
    • Aggregation: SQL provides aggregate functions that perform calculations on a set of rows and return a single summary value. These are essential for summarizing data:
    • COUNT(): To count the number of rows or non-null values. For example, counting the total number of employees.
    • SUM(): To calculate the total sum of values in a column. For example, finding the total salary of all employees.
    • AVG(): To calculate the average of values in a column. For example, finding the average age of employees.
    • MIN(): To find the minimum value in a column. For example, identifying the lowest salary.
    • MAX(): To find the maximum value in a column. For example, determining the highest salary.
    • Grouping Data: The GROUP BY clause allows you to group rows that have the same values in one or more columns into summary rows. This is often used in conjunction with aggregate functions to perform analysis on different categories. For instance, finding the average salary for each department.
    • Filtering Groups: The HAVING clause is used to filter groups created by the GROUP BY clause based on specified conditions, often involving aggregate functions. For example, identifying countries where the average salary is greater than a certain threshold.
    • Joining Tables: When data for analysis is spread across multiple related tables, JOIN operations in SQL are used to combine data from these tables based on common columns. This allows you to bring together relevant information for a comprehensive analysis, such as combining customer information with their order details. As mentioned in the source, you can even join three or more tables.
    • Using Inbuilt Functions: SQL provides various inbuilt functions that can be used for data manipulation and analysis. These include:
    • Mathematical Functions: For performing calculations (e.g., ABS(), MOD(), SQRT(), POWER()).
    • String Functions: For manipulating text data (e.g., LENGTH(), CONCAT(), UPPER(), LOWER(), SUBSTRING(), REPLACE()).
    • Date and Time Functions: For working with temporal data (e.g., CURRENT_DATE(), NOW(), extracting day, year).
    • Creating Calculated Fields: Using SQL, you can create new columns based on existing data through calculations or conditional logic. The CASE statement allows you to define different values for a new column based on conditions evaluated on other columns, enabling the categorization of data (e.g., creating a salary range category based on salary values).
    • Subqueries (Nested Queries): SQL allows you to write queries within other queries, which can be used to perform more complex data retrieval and analysis. For example, selecting employees whose salary is greater than the average salary calculated by a subquery.
    • Views: Views are virtual tables based on the result of an SQL statement. They can simplify complex queries and provide a focused perspective on the data, making analysis easier by presenting a subset of data in a more manageable format.
    • Common Table Expressions (CTEs): CTEs are temporary, named result sets defined within the scope of a single query. They can break down complex analytical queries into smaller, more readable, and manageable parts.

    These data analysis functions, facilitated by SQL, are crucial skills for a data analyst, as highlighted in our earlier discussion about the necessary skills for this role [Your previous turn]. Proficiency in using these SQL features allows data analysts to effectively extract, manipulate, summarize, and analyze data stored in databases to derive meaningful insights.

    SQL Full Course 2025 | SQL Tutorial for Beginners | SQL Beginner to Advanced Training | Simplilearn

    The Original Text

    hello everyone and welcome to SQL fos by simply learn have you ever wondered how apps manage data or how businesses handle massive data sets the answer lies in SQL structured query language is the backbone of data management and Analysis making it a must have skill for data analyst developers and database administrators as well as IND indes become more datadriven the demand for SQL experts is skyrocketed and by 2025 job opportunities in fields like SQL development and data analysis with search with starting salaries reaching around $50,000 in the US and around 4 to8 lakh perom in India and even experienced professionals earn around $100,000 or 20 lakh perom in India this course will take you from a beginner level to see SQL expert you learn how to write queries join tables use subqueries and apply SQL for Hands-On data analysis and by the end you’ll be equipped to manage and manipulate data like a pro so let’s get started but before that if you’re interested to make a current data analytics check out Simply learn’s postgraduate program in data analytics this comprehensive course is designed to transform you into a data analyst export this program covers essential skills such as data visualization statistical analysis machine learning using industry leading tools and Technologies like XLR Python and even tablet the course link is mentioned in description box below and in the pin comment so hurry up and enroll now in this session we are going to learn about databases how data is stored in relational databases and we’ll also look at some of the popular databases finally we’ll understand various SQL commands on my SQL Server now let’s get started with what is a database so according to Oracle a database is an organized collection of structured information or data that is typically stored electronically in a computer system a database is usually controlled by a database management system or dbms so it is a storage system that has a collection of data relational databases store data in the form of tables that can be easily retrieved managed and updated you can organize data into tables rows columns and index it to make it easier to find relevant information now talking about some of the popular databases we have mySQL database we also have Oracle database then we have mongod DV which is a no SQL database next we have Microsoft SQL Server next we have Apache cassendra which is a free and open source nosql database and finally we have postgress SQL now let’s learn what is SQL so SQL is a domain specific language to communicate with databases SQL was initially developed by IBM most databases use structured query language or SQL for writing and querying data SQL commands help you to store process analyze and manipulate databases with this let’s look at what a table is so this is how a table in a database looks like so here you can see the name of the table is players on the top you can see the column names so we have the player ID the player name the country to which the player belongs to and we also have the goals scored by each of the players so these are also known as fields in a database here each row represents a record or a tle so if you have the player ID which is 103 here the name of the player is Daniel he is from England and the number of goals he has scored is seven so you can use SQL commands to query update insert records and do a lot of other tasks now we’ll see what the features of SQL are SQL lets you access any information stored in a relational database with SQL queries data is extracted from the database in a very efficient way the structured query language is compatible with all database systems from Oracle IBM to Microsoft and it doesn’t require much coding to manage databases now we will see applications of SQL SQL is used to create a database Define its structure implement it and let you perform many functions SQL is also used for maintaining an already existing database SQL is a powerful language for entering data modifying data and extracting data in a database SQL is extensively used as a client server language to connect the front end with the back end the supporting the client server architecture SQL when deployed as data control language DCL helps protect your database from unauthorized access if you categor the steps to become a data analyst these are the ones firstly you need to focus on skills followed by that you need to have a proper qualification then test your skills by creating a personal project an individual project followed by that you must focus on building your own portfolio to describe your caliber to your recruiters and then Target to the entry level jobs or internships to get exposure to the real world data problems so these are the five important steps now let’s begin with the step one that is skills so skills are basically categorized into six steps Ed cleaning data analysis data visualization problem solving soft skills and domain knowledge so these are the tools Excel MySQL our programming language Python programming language some data visualization tools like TBL Loop powerbi and next comes the problem solving so these are basically the soft skill Parts problem solving skills domain knowledge the domain in which you’re working maybe a farma domain maybe a banking sector maybe automobile domain Etc and lastly you need to be a good team player so that you can actively work along with the team and solve the problem collaboratively now let’s move ahead and discuss each and every one of these in a bit more detail starting with Microsoft Excel while Advanced tools are prevalent Proficiency in Excel remains vital for data analyst Excel versatility in data manipulation visualization and modeling is Unown Managed IT serves as a foundational tool for initial data exploration and basic analysis data management database management skill is indispensable for data analyst as data volume saw efficient management and retrieval from datab basis is critical Proficiency in ddb systems and querying languages like SQL ensures analyst can access and manipulate data seamlessly followed by that we have statistical analysis statistical analysis allow analyst to uncover hidden Trends pattern and cor relationships within data facilitating evidence-based decision making it empowers analyst to identify the significance of findings validate hypothesis and make reliable predictions next after that we have programming languages Proficiency in programming languages like python is essential for data analysis these languages enable data manipulation Advanced statistical analysis and machine learning implementations next comes data storytelling or also known known as data visualizations data storytelling skill is Paramon for data analyst data storytelling Bridges the gap between data analysis and actionable insights ensuring that the value of data is fully realized in a world where data driven communication is Central to business success data visualization skill is a CornerStore for data analyst as data complexity grows the ability to present insights clearly and persuasively is Paramount next is managing your customers and problem solving managing all your customers data and Company relationships is Paramount strong problem solving skills are important for data analyst with complex data challenges and evolving analytical methodologies analyst must excel in identifying issues formulating hypothesis and devising innovative solutions in addition to the technical skills data analyst in 2025 will require strong soft skills to excel in their roles here are the top ones data analyst must effectively communicate their findings to both Technical and non-technical stakeholders this includes presenting complex data in a clear and understandable manner next soft skill is teamwork and collaboration data analysts often work with multidisciplinary teams alongside data scientists data Engineers business professionals collaborative skills are essential for sharing insights brainstorming Solutions and working cohesively towards common goals and last but not least domain knowledge knowledge on domain in which you’re currently working is really important it might be a formatical domain it can be an automobile domain it can be banking sector and much more unless you have a basic foundational domain knowledge you cannot continue in that domain with accurate results now the next step which was about the qualification to become a data analyst Master’s courses online courses and boot camps provide strong structured learning that helps you gain in-depth knowledge and specialized skills in data analysis masters programs offer comprehensive academically recr training and often include research projects making sure you’re highly competitive in the job market online courses allow flexibility to learn at your own pace while covering essential topics and boot gaps offer immersive Hands-On training in a short period focusing on practical skills all three parts enhance your credibility keeping you updated on industry Trends and make you more attractive to potential employers if you are looking for a well curated allrounder then we have got you covered simply learn offers a wide range of courses on data science and data analytics starting from Masters professional certifications to post graduations and boot camps from globally reputed and recognized universities for more details check out the links in the description box below and comment section now proceeding ahead we have the projects for data analyst data analyst this projects demonstrate practical skills in data cleaning visualization and Analysis they help build a portfolio showcasing your expertise and problem solving abilities projects provide hands-on experience Bridging the Gap between Theory and real world application this show domain knowledge making you more appealing to employees in specific Industries projects enhance your confidence and prepare you to discuss real world challenges in interviews proceeding ahead the next step is about the portfolio for data analysts a portfolio is a testament that demonstrates your skill and expertise through real world projects showcasing your ability to analyze and interpret data effectively it provides tangible proof of your capabilities making you stand out to the employers additionally it highlights your domain knowledge and problem solving skills giving you a Competitive Edge during job applications and interviews last but not the least data analyst internships internships provide hands-on experience with real world sets tools and workflows Bridging the Gap between Theory knowledge and practical application they offer exposure to Industry practices helping you understand how data is used to drive decisions internships also build you Professional Network enhance your resuming and improve chances of securing a full-time data analy role so let’s understand what 10 year diagram is an entity relationship diagram describes the relationship of entities that needs to to be stored in a database ER diagram is mainly a structural design for the database it is a framework made using specializ symbols to define the relationship between entities ER diagrams are created based on the three main components entities attributes and relationships let’s understand the use of ER diagram with the help of a real world example here a school needs all its Student Records to be stored digitally so they approach an IT company to do so a person from the company will meet the school authorities note all their requirements describe them in the form of ear diagram and get it cross checked by the school authorities as the school authorities approve the year diagram the database Engineers would carry further implementation let’s have a view of an ear diagram the following diagram showcases two entities student and course and the relationship the relationship described between student and course is many to many as a course can be opted by several students and a student can opt for more than one course here student is the entity and it processes the attributes that is student ID student name and student age and the course entity has attributes such as course ID and course name now we have an understanding of Y diagram let us see why it has been so popular The Logical structure of the database provided by a diagram communicates the landscape of business to different teams in the company which is eventually needed to support the business year diagram is a GUI representation of The Logical structure of a database which gives a better understanding of the information to be stored in a database database designers can use ER diagrams as a blueprint which reduces complexity and helps them save time to build databases quickly ear diagrams helps you identify the enti ities that exist in a system and the relationships between those entities after knowing its uses now we should get familiar with the symbols used in your diagram the rectangle symbol represents the entities oral symbol represents attributes a rectangle embedded in a rectangle represents a weak entity a dashed toal represents a derived attribute a diamond symbol represents a relationship among entities double all symbol represents multivalued attributes now we should dive in and learn about the components of ER diagram there are three main components of ER diagram entity attribute and relationship entities have weak entity attributes are further classified into key attribute composite attribute multivalued attribute and derived attribute relationships are also classified into one to one relationships one to many relationships many to one relationships and many to many relationships let’s understand these components of V diagram starting with entities an entity can be either a living or a non- living component an entity is showcased as a rectangle in a near diagram let’s understand this with the help of a near diagram here both student and course are in rectangular shape and are called entities and they represent the relationship study in a diamond shape let’s transition to weak entity and an entity that makes Reliance over another entity is called a weak entity the weak entity is showcased as a double rectangle in ER diagram in the example below the school is a strong entity because it has a primary key attribute School number unlike the school the classroom is a weak entity because it does not have any primary key and the room number attribute here acts only as a discriminator and not a primary key now let us know about attributes attribute an attribute exhibits the properties of an entity an attribute is Illustrated with an oval shape in an ER diagram in the example below student is an entity and the properties of student such as address age name and role number are called its attributes let’s see our first classification under attribute that is key attribute the key attribute uniquely identifies an entity from an entity set the text of a key attribute is underlined in the example below we have a student entity and it has attributes name address role number and age but here role number can uniquely identify a student from a set of students that’s why it is termed as a key attribute now we will see composite attribute an attribute that is composed of several other attributes is known as a composite attribute and oval showcases the composite attribute and the composite attribute oval is further connected with other ovals in the example below we can see an attribute name which can have further subparts such as first name middle name and last name these attributes with further classification is known as composite attribute now let’s have a look at multivalued attribute an attribute that can possess more than one value are called multivalued attributes these are represented as double old shape in the example below the student entity has attributes phone number role number name and age out of these attributes phone number can have more than one entry and the attribute with more than one value is called multivalued attribute let’s see derived attribute an attribute that can be derived from other attributes of the entity is known as a derived attribute in the ER diagram the derived attribute is represented by dashed over and in the example below student entity has both date of birth and age as attributes here age is a derived attribute as it can be derived by subtracting current date from the student date of birth now after knowing attributes let’s understand relationship in ER diagram a relationship is showcased by the diamond shape in the year diagram it depicts the relationship between two entities in the below for example student study course here both student and course are entities and study is the relationship between them now let’s go through the type of relationship first is one to one relationship when a single element of an entity is associated with a single element of another entity this is called one to one relationship in the example below we have student and identification card as entities we can see a student has only one identification card and an identification card is given to one student it represents a one to one relationship let’s see the second one one to many relationship when a single element of an entity is associated with more than one element of another entity is called one to many relationship in the below example a customer can place many orders but a particular order cannot be placed by many customers now we will have a look at many to one relationship when more than one element of an entity is related to a single element of another entity it is called many to one relationship for example students have to opt for a single course but a course can be opted by number of students let’s see many to many relationship when more than one element of an entity is associated with more than one element of another entity is called many to many relationship for example an employee can be assigned to many projects and many employees can be assigned to a particular project now after having an understanding of ER diagram let us know the points to keep in mind while creating the year diagram first identify all the entities in the system embed all the entities in a rectangular shape and label them appropriately this could be a customer a manager an order an invoice a schedule Etc identify relationships between entities and connect them using a diamond in the middle illustrating the relationship do not connect relationships connect attributes with entities and label them appropriately and the attribute should be in Old shape assure that each entity only appears a single time and eradicate any redundant entities or relationships in the ear diagram make sure your ER diagram supports all the data provided to design the database make effective use of colors to highlight key areas in your diagrams there are mainly four types of SQL commands so first we have data definition language or ddl so ddl commands change the structure of the table like creating a table deleting a table or altering a table all the commands of ddl are autoc committed which means it permanently save all the changes in the database we have create alter drop and truncate as ddl commands next we have data manipulation language or DML so DML commands are used to modify a database it is responsible for all forms of changes in the database DML commands are not autoc committed which means it can’t permanently save all the changes in the database we have select update delete and insert as DML commands now select command is also referred to as dql or data query language third we have data control language or DCL so DCL commands allow you to control access to data within the database these DCL commands are normally used to create objects related to user access and also control the distribution of privileges among users so we have Grant and revok which are the examples of data control language finally we have something called as transaction control language or TCL so TCL commands allow the user to manage database transactions commit and roll back our example of TCL now let’s see the basic SQL command structure so first we have the select state stat M so here you specify the various column names that you want to fetch from the table we write the table name using the from statement next we have the we Clause to filter out our table based on some conditions so you can see here we condition one condition two and so on then we have the group by Clause that takes various column names so you can write Group by column 1 column 2 and so on next we have the having Clause to filter out tables based on groups finally we have the order by Clause to filter out the result in ascending or descending order now talking about the various data types in SQL so we have exact numeric which has integer small int bit and decimal then we have approximate numeric which are float and real then we have some date and time data types such as date time time stamp and others then we have string data type which includes car the varar car and text finally we have binary data types and binary data types have binary VAR binary and image now let’s see some of the various operators that are present in SQL so first we have our basic arithmetic operators so you have addition the substraction multiplication division and modulus then we have some logical operators like all and any or between exist and so on finally we have some comparison operators such as equal to not equal to that’s greater than less than greater than equal to or less than equal to not less than or not greater than now let me take you to my MySQL workbench where we will learn to write some of the important SQL commands use different statements functions data types and operators that we just learned in this session we will learn how to install MySQL workbench and then we will run some commands firstly we will visit the official Oracle website that is myql.com and now we’ll move to the downloads page now scroll down and click on my SQL GPL downloads now under Community downloads click on my SQL installer for Windows the current versions are available to download I will choose this installer and click the download button now here just click on no thanks just start my download Once the installer has download it open it you may be prompted for permission click yes this opens the installer we will be asked to choose the setup type we will go with custom click next now you have to select the products you want to install we will install only the MySQL server my SQL shell and the MySQL workbench expand my SQL servers by double clicking on it and choose the version you want to install and click on this Arrow now you have to do the same thing for applications expand applications and choose the MySQL workbench version you want to install and click on the arrow and we’ll do the same thing for my SQL shell we’ll choose the latest version click on the Arrow so these are the products that have to be installed in a system now we will click next I’ll click execute to download and install the server this may take a while depending on your internet speed as the download is completed click next now you see the product configuration click next now we’ll configure our SQL Server instance here we will go with the default settings and click next and under authentication select use strong password encryption for authentication which is recommended and click on next now here set the password for your root user by the way root is the default user this user will have access to everything I will set my password now I’ll click on next and here also we’ll keep the default settings and click on next now to apply configuration we will execute the process once Sol the conf ification steps are complete click finish now you will see the installation is complete it will launch my SQL workbench and my SQL shell after clicking on finish now the shell and workbench has started now we’ll connect by clicking on the root user it will ask for a password enter the password and it will connect successfully yeah the workbench has started now we’ll just connect the server so first we’ll open command prompt now we will reach the path where MySQL files are present you go into this PC local d c program files my SQL my SQL Server 8.0 bin and now I’ll copy this path now we’ll open the command prom and write a command CD space and paste the link and press enter now we write another command that will be my SQL minus u space root minus p and enter now it will ask for your password just enter the password and press enter now the server has started and now we’ll see some commands in my SQL workbench first we will open my SQL workbench now we’ll click on the local instance my SQL 80 and enter the password to connect to the Local Host yeah the my SQL workbench has started now we’ll see some commands the First Command we will see is show databases show databases semicolon and now we will select the whole command and click on this execute button and here we will see the result in the result grit these are the databases that are stored already in the database now there are four databases that is information schema MySQL performance schema and SS now we will select one of the database we will use uh my SQL now we have selected the mySQL database and now in this database we will see which tables are stored in this mySQL database to see that we will run a command show tables we’ll select the command and click on the execute button the these are the tables that are stored in this mySQL database that is columns _ PR component DP and much more now let me now go ahead and open my MySQL workbench so in the search bar I’ll search for MySQL workbench you can see I’m using the 8.0 version I’ll click on it and here it says welcome to my SQL workbench and Below under connections you can see I have already created a connection which says local instance then you have the root the local host and the port number let me click on it you can see the service the username is root and I’ll enter my password and hit okay now this will open the SQL editor so this is how the MySQL workbench looks like here we learn some of the basic SQL commands so first let me show you the databases that are already present so the command is so databases you can hit tab to autoc complete I’ll use a semicolon I’ll select this and here on the top you can see the execute button so if I run this below you can see the output it says show databases seven rows are returned which means currently there are seven databases you can see the names all right now let’s say I want to see the T tables that are present inside this database called world so I’ll use the command use World which is the database name now let me run it so currently I’m using the world database so to display the tables that are present in the world database I can use the show command and write show tables give a semicolon and I’ll hit control enter this time to run it all right so you can see the tables that are present inside this world database so we have three tables in total City Country and Country language now if you are to see the rows that are present in one of the tables you can use the select command so I’ll write select star which basically means I want to display all the columns so star here means to display all the columns then I’ll write my from the table name that is City so this command is going to display me all the rules that are present inside the city table so if I hit control enter all right you can see the message here it says th000 rows were returned which means there were total thousand records present inside the city table so here you can see there’s an ID column a name column this country code district and population all right similarly you can check the structure of the table by using the describe command so I’ll write describe and then I’ll give the table name that is City now let’s just run it there you go the field shows the column names so we have ID name country code district population type here shows the data type of each of the columns so district is character 20 ID is an integer population is also integer null says yes or no which means if no then there are no null values if it’s yes which means there are null values in your table key here represents whether you have any primary key or foreign key and these are some extra information now let’s learn how to create a table in my SQL so I’ll use the create table command for this and before that let me create a database and I’ll name it as SQL intro so the command is create database and I’ll give my database name that is SQL intro me give a semicolon and hit control enter so you can see I have created a new database now if I run this command that is show databases you can see this newly created database that is SQL intro if I scroll down there you go you can see the name here SQL intro okay now within this database we’ll create a table called employee details now this will have the details of some employees so let me first show you how to create a table that will be present inside the SQL intro database so I’ll use the command create table and then I’ll give my table name that is going to be employee uncore details next the syntax is to give the column names so my first column would be the name column which is basically the employee name followed by the data type for this column since name is a text column so I’ll use varar and I’ll give a value of 25 so it can hold only 25 characters okay next I also want the age of the employee now age is always an integer so I’ll give int okay then we can have the gender of the employee so gender can be represented as f for m f for female and M for male so I’m using the card data type or character data type and I’ll give the value as one then let’s have the date of join or doj and this is going to be of data type date all right next we’ll have the city name that is the city to which the employee belongs to so again again this is going to be warar 15 finally we’ll have a salary column and salary we’ll keep it as float since salary can be in decimal numbers as well now I’ll give a semicolon all right so let me just quickly run through it so first I wrote my create command then the table which is also a keyword followed by the table name which is employee details here and then we give the column names such as name age this gender date of join City and salary for each of the columns we also give the data type all right so let me just run it okay so here you can see we have successfully created our first table now you can use the describe command to see the structure of the table I’ll write this describe empore details if I run this there you go so under field you can see the column names then you have the data types null represents if the table can accept null values or not and these are basically empty and we haven’t set any default constraint all right moving ahead now let’s learn to add data to our table using the insert command so on a notepad I have already written my insert statement so let me just copy it and then I’ll explain it one by one all right so if you see this so we have used an insert into statement or a command followed by the table name that is EMP details then this is the syntax using values I have passed in all the records so first we have Jimmy which is the name of the employee then we we have 35 it basically represents the age then m means the gender or the sex then we have the date of join next we have the city to which the employee belongs to and finally we have the salary of the employee so this particular information represents one record or a tle similarly the next employee we have is Shane you can see the age and other information then we have Mary this Dwayne Sara and am all right so let me go ahead and run this so this will help you insert the values in the table that you have created you can see we have successfully inserted six records now to display the records let me use the select statement so I’m using select star from empore details if I run this you can see my table here and the values it has so we have the name column the age column the state of join City salary and these are the values that you can see here moving ahead now let’s say you want to see the Unique city names present in the table so in this case you can use the distinct keyword along with the column name in the select statement so let me show you how you can print the distinct city names that are present in our table now if you notice this table clearly we have Chicago Seattle Boston Austin this New York and this Seattle repeated again so I only want to print the unique values so for that I can write my select statement as select distinct then I’ll give my column name which is City from my table name that is EMP details if I run this you can see my query has returned five rows and these are the values so we have Chicago cattl which was repeated twice is just been shown once then we have Boston Austin and New York now let’s see how you can use inbuilt aggregate functions in SQL so suppose you want to count the number of employees in the table in that case you can use the count function in the select statement so let me show you how to do that so I’ll write select I’ll use my function name name that is Count now since I want to know the total number of employees I’m going to use their name inside the brackets from employee _ details now if I run this this will return the total number of employees that are present in the table so we have six employees in total now if you see here in the result it says count name now this column is actually not readable at all so what SQL provides something called as an alas name so you can give an alas to the resultant output so here I can write select count of name and use an alas as as I can give an alas as countor name and run this statement again there you go you can see here in the resultant output we have the column name as count name which was Alias name now suppose you want to get the total sum of salaries you can use another aggregate function called sum so I’ll write my select statement and this time instead of count I’m going to write sum and since I want to find the sum of salaries so inside the bracket I’ll give my salary column from my table name that is employee details if I run this this will result the total sum of salaries so basically it adds up all the salaries that were present in the salary column now let’s say you want to find the average salary so instead of sum you can write the average function which is ABG so this will give you the average salary from the column salary so you can see it here this says average salary now if you want you can give an alas name to this as well now you can select specific columns from the table by using the column names in the select statement so initially we were selecting all the columns for example like you saw here the star represents that we want to see all the columns from the employee details table now suppose you want to see only specific columns you can mention those column names in the select statement so let’s say I want to select just the name age and the city column from my table that is employee details so this will result in displaying only the name age and City column from the table if I run it there you go it has given only three columns to me now SQL has a we Clause to filter rows based on a particular condition so if you want to filter your table based on specific conditions you can use we Clause now we Clause comes after you give your table name so suppose you want to find the employees with age greater than 30 in this case you can use a we Clause so let me show you how to do it I’ll write select star from my table name that is employee details and after this I’ll use my wear Clause so I’ll write where age greater than 30 if I run this it will give me the output where the age is only greater than 30 so it excluded everything that is less than 30 so we have four employees whose age is greater than 30 here now suppose you want to find only female employees from the table you can also use a wear Clause here so I’ll write select let’s say I want only the name the gender which is sex here comma City from my table that is employee details where I’ll give my column name that is sex is equal to since I want only the female employees I’ll give F and run this statement okay you can see here our employee table has three female employees now suppose you want to find the details of the employees who belong to Chicago or Austin in this case you can use the or operator now the or operator in SQL displays a record if any of the condition separated by R is true so let me show you what I mean so since I want the employees who are from Chicago and Austin I can use an or operator so I’ll write select star from EMP details which is my table name then I’ll give my we Clause where City equal to I’ll give my city name as Chicago and then I’m going to use the or operator or city equal to I’ll write Austin I’ll give a semicolon and let me run it there you go so in the output you can see all the employees who belong to the city Chicago and Austin now there is another way to write the same SQL query so you can use an in operator to specify by multiple conditions so let me just copy this and instead of using the r operator this time I’m going to use the in operator so I’ll delete this after the wear Clause I’m going to write where City and use the in operator inside bracket I’ll give my city names as shago and I want Austin so I’ll give a comma and write my my next city name that is Austin so this query is exactly the same that we wrote on top let me run this you will get the same output there you go so we have Jimmy and Dwayne who are from Chicago and Austin respectively now SQL provides the between operator that selects values within a given range the values can be numbers text or dates now suppose you want to find the employees whose date of join was between 1st of Jan 2000 and 31st of December 2010 so let me show you how to do it I’ll write select star from EMP details where my date of join that is doj between I’ll give my two date values that is 1st of Jan 2000 and I’ll give my second value the date value that is 31st of December 2010 so every employee who has joined between these two dates will be displayed in the output if I run it we have two employees who had joined between 2000 and 2010 so we have Jimmy and Mary here who had joined in 2005 and 2009 respectively all right now in we Clause you can use the and operator to specify multiple conditions now the and operator displays a record if all the conditions separated by and are true so let me show you an example I’ll write select star from employee details table where I want the age to be greater than 30 and I want sex to be male all right so here you can see I have specified two conditions so if both the conditions are true only then it will result in an output if I run it you can see there are two employees who are male and their age is greater than 30 now let’s talk about the group by statement in SQL so the group by statement groups rows that have the same values into summary rows like for example you want to find the average salary of customers in each department now the group by statement is often used with aggregate functions such as count sum and average to group the result set into one or more columns let’s say we want to find the total salary of employees based on the gender so in this case you can use the group by Clause so I’ll write select let’s say sex comma I want to find the total sum of salary as I’ll give an alas name let’s say total salary from my table name that is employee details next I’m going to group it by sex okay let me run it there you go so we have two genders male and female and here you can see the total salary so what this SQL statement did was first it grouped all the employees based on the gender and then it found the total salary now SQL provides the order by keyword to sort the result set in ascending or descending order now the order by keyword sorts the records in ascending order by default to sort the records in descending order you can use the dec keyword so let’s say I want to sort my employee details table in terms of salary so I’ll write select star from empore details and I’ll use my order by clause on the salary column so this will sort all the records in ascending order of their salary which is by default you can see the salary column is sorted in ascending order now suppose you want to sort the salary column and display it in descending order you can use this keyword that is DEC let me run it you can see the output now this time the salary is sorted in descending order and you have the other values as well now let me show you some basic operations that you can do using the select statement so suppose I write select and do an addition operation let’s say 10 + 20 and I’ll give an alas name as addition if I run this it will give me the sum of 10 and 20 that is 30 similarly you can use the subtraction operator and you can change the alas name as let’s say subtract let’s run it you get minus 10 now there are some basic inbuilt functions there are a lot of inbuilt functions in SQL but here I’ll show you a few suppose you want to find the length of a text or a string you can use the length L function so I’ll write select and then use the length function I’ll hit tab to autocomplete let’s say I want to find the length of country India and I’ll give an alas as total length if I run it you see here it returns five because there are five letters in India there’s another function called repeat so let me show you how repeat works so I’ll write select repeat let’s say I want to repeat the symbol that is at the rate I’ll put it in single codes because it is a text character and I want to repeat this character for 10 times close the bracket and let’s run it you can see here in the output it has printed at the rate 10 times you can count it all right now let’s say you want to convert a text or a string to upper case or lower case you can do that as well so I’ll write select and use the function called upper let’s say I want to convert my string that is India to uppercase I’m not giving in any alas name if I run this see my input was capital I and everything else was in small letter in the output you can see it has converted my input to all caps similarly you can change this let’s say you want to print something in lower case you can use the lower function let’s say this time everything is in upper case if I run it it converts India to lower case now let’s explore a few date and time functions let’s say you want to find the current date there’s a function called C which stands for current and this is the function I’m talking about which is current date if I run this you will get the current date that is 28th of Jan 2021 and let’s say you want to extract the day from a date value so you can use the D function let’s say I’ll use D and I want to find the D from my current date if I run this you get 28 which is today’s day now similarly you can also display the current date and time so for that you can use a function that is called now so this will return the current date and time you can see this is the date value and then we have the current time all right and this brings us to the end of our demo session so let me just scroll through whatever we have learned so first I showed you how you can see the databases present in my SQL then we use used one of the databases and checked the tables in it then we created another database called SQL intro for our demo purpose we used that database and then we created this table called employee details with column names like name integer the sex date of joints City and salary I showed you the structure of the database let me run this again so you get an idea you can see this was the structure of our table the then we went ahead and inserted a few records so we inserted records for six employees so you have the employee name the age the gender the date of join the city to which the employee belongs to and the salary of the employee then we saw how you can use the select statement and display all the columns present in the table we learned how you can display the Unique city names we learned how to use different aggregate function like count average and sum then we learned how you could display specific columns from the table we learned how to use we Clause then we used an R operator we learned about in operator the between operator then we used an and operator to select multiple conditions finally we learned about group buy order buy and some basic SQL operations now it’s time to explore some string functions in MySQL so I have given a comment string functions first let’s say you want to convert a certain string into upper case so I can write select the function I’ll use is upper and within this function you can pass in the string let’s say I’ll write India if you want you can give an alas name as let’s say uppercase I’ll give a semicolon and let’s run it there you go so my input was in sentence case and using the upper function we have converted everything into uppercase similarly let me just copy this and I’ll show you if you want to convert a string into a lower case you can use the lower function I’ll run this you can see the result everything is in lower case now of course I need to change the alas name to lower case instead of using lower as the function there is another function that MySQL provides which is called the L case so I’ll just edit this and write L case and let’s say I’ll write India in uppercase let’s run it returns me the same result cool moving on let’s say you want to find the length of a string you can use the character length function I’ll write select use the function character length and I’m again going to pass in my string as India as let’s say total length let’s run it this time I’m going to hit control enter to run my SQL command there you go it has given us the right result which is five because India has five characters in it now these functions you can also apply on a table now let me show you how to do it let’s say we already have the students table and you want to find the length of each of the student names so here you can pass sore name and you can give the same alas name let’s say total length and then you can write from table name that is students if I run this you can see the output it has given me total 20 rows of information this not readable actually let me also so display these student names so that we can compare their length all right I’ll run this again and now you can see the result so Joseph has six characters NES has six vipul has five anubhab has seven similarly if you see Aki has six Tanish has seven ragav has six Cummins has seven rabada has six so on and so forth now instead of using this character length you can also use the function car length it will work the same way let’s see the result there you go it has given us the same result you can either use character length or car length there’s another very interesting function called concat so the concat function adds two or more Expressions together let’s say I’ll write select use the function concatenate the function is actually concat and I’m going to pass in my string values let’s say India is in Asia let’s run this and see our result you can see see here it has concatenated everything let us make it more readable I’ll give a space in between so that you can read it clearly now this is much more readable India is in Asia and if you want you can give an alas name as well as let’s say merged there you go now the same concat operation you can also perform on a table I’m going to to use the same students table let’s say I want to return the student ID followed by the student name and then I am going to merge the student name followed by your space followed by the age of the student and I can give an alas as let’s say name _ AG from my table that is students let’s see how this works okay you see here the result is very clear we have the student ID the student name and the concatenated column that we created which was name _ age where we have the student name with a space followed by the age of the student if I scroll down you can see the rest of the results cool now moving ahead let’s see how the reverse function Works in MySQL so the MySQL reverse function returns a string with the characters printed in reverse order so suppose I write select reverse I’ll use the same string again let’s say I have India let’s run it you will see all the characters printed in reverse order again you can perform the same operation on a table as well let’s say I’ll write select reverse and I’ll pass in the column as student name from my table that is students let’s run it it gives you 20 students and all the names have been printed in reverse order okay now let’s see what the replace function does so the replace function replaces all occurrences of a substring within a string within a new substring so let me show you what I mean I’ll write select replace I’ll pass in my input string which is let’s say orange is a vegetable which is ideally incorrect I’m purposely writing this so that I can replace the word vegetable with fruit okay so what this replace function does is it is going to find where my word vegetable is within the string my input string and it is going to replace my word vegetable with fruit let’s run it and see the output there you go now this is correct which is Orange is a fruit all right now MySQL also provides some trim functions you can use the left trim right trim and just the trim function so let me show you how this left trim Works left trim or L trim removes the leading space characters from a string passed as an argument so see I write select I’ll use the left trim function which is L trim and then I’m going to purposely give a few pces in the beginning of the string I’ll give a word let’s say India and then I’ll give some space after the word India and see how the elri works if I run this it gives me India which is fair enough but before that let’s first find the length of my string so I’ll use my length function here and within this function I am going to find the length of my string which has India along with some leading and trailing spaces I’ll paste this here give a semicolon and I’ll run it okay so the entire string is 17 characters long or the length of the string is 17 now say I use lrim on my same string what it returns me is India and if I run length over it you can see the difference as in you can see how many spaces were deleted from the left of the string you can see here now it says 17 and I’m going to use lrim let’s see the difference it gives me 12 the reason being it has deleted five spaces from the left you can count it 1 2 3 4 and 5 so 17 – 5 is 12 which is correct similarly you can use the rri function which removes the trailing spaces from a string trailing spaces are these spaces when you use left Rim it deletes the leading spaces which is this now let me just replace L trim with r trim which stands for right trim and see the result so the length is 10 now the reason being it has deleted seven spaces from the right of the string if you can count it 1 2 3 4 5 6 and 7 cool you can also use the trim function which will delete both the leading and the TR in spaces so here if I just write trim and I’ll run it it gives me five because India is total five characters long and it has deleted all the leading and the trailing spaces all right there’s also a function called position in MySQL the position function Returns the position of the first occurrence of a substring in a string so if the substring is not found with the original string the function will return zero so let’s say I’ll write select position I want to find where fruit is in my string that is Orange is a fruit I’ll give an alas as name there some error here this should be within double quotes now let’s run it and see the result okay it says at the 13th place or at the 13th position we have the word fruit in our string which is Orange is a fruit now the final function we are going to see is called asky so the asky function Returns the asky value for a specific character let’s say I write select ask key of the letter small a if I run this it will give me the ask key value which is 97 let’s say you want to find the ask key value of 4 let’s see the result it gives me 52 all right in this session we are going to learn two important SQL statements or Clauses that are widely used that is Group by and having first we’ll understand the basics of group by and having and then jump into my SQL workbench to implement these statements so let’s begin first what is Group by in SQL so the group by statement or Clause groups records into summary rows and returns one record for each group it groups the rows with the same group by item expressions and computes aggregate functions for the resulting group a group by Clause is a part of Select expression in each group no two rows have the same value for the grouping column or columns now below you can see the syntax of group by so first we have the select statement and Then followed by the column names that we want to select from we have the table name followed by the wear condition and next we have the group by clause and here we include the column names finally we have the order by and the column names now here is an example of the group by Clause so we want to find the average salary of employees for each department so here you can see we have the employees table it has the employee ID the employee name the age of the employee we have the gender the date on which the employeer had joined the company then we have the department to which each of these employees belong to we have the city to which the employees belong to and then we have the salary in dollars so actually we’ll be using this employees table on my SQL workbench as well so if you were to find the average salary of employees in each department so this is how your SQL query with Group by Clause would look like so we have selected department and then we are using an aggregate function that is AVG which is average and we have chosen the salary column and here we have given an alias name which is average uncore salary which appears in the output you can see here from employees and we have grouped it by department so here in the output you can see we have the department names and the average salary of the employees in each department now let me take you to my MySQL workbench where we’ll Implement Group by and solve specific problems okay so I am on my MySQL workbench so let me make my connection first I’ll enter the password so this will open my SQL editor so first of all let me check the databases that I have so I’ll use my query that is show databases let’s run it okay you can see we have a list of databases here I’m going to use my SQL intro database so I’ll write use SQL intro so this will take us inside this database I run it all right now you can check the tables that are present in SQL intro database if I write show tables you can see the list of tables that are already present in this database to do our demo and understand Group by as well as having let me first create an employee table so I’ll write create table employees next I’ll give my column name as employee _ ID which is the ID for each employee I’ll give my data type as integer and I’ll assign employee ID as my primary key next I’ll give employee name and my data type would be varar I’ll give the size as 25 my third column would be the age column age would obviously be an integer then I have my gender column I’ll use character data type and assign a value of one or size of one next we have the date of join and the data type will be date we have the department column as well this is going to be of varar and 20 will be the size next we have the city column which is actually the city to which the employee belongs to and finally we have the salary column which will have the salary for all the employees okay now let me select and run this you can see here we have successfully created our table now to check if our table was created or not you can use the describe command I’ll write describe employees you can see the structure of the table so far all right now it’s time for us to insert a few records into this employees table so I’ll write insert into employees and I’ll copy paste the records which have already written on a notepad so let me show you so this is my EMP notepad and you can see I have already put the information for all the employees so let me just copy this and we’ll paste it here all right let me go to the top and verify if all the records are fine all right so let’s run our insert query okay so you can see here we have inserted 20 rows of information and now let’s check the table information or the records that are present in our employees table I’ll write select star from employees if I run it you can see here I have my employee ID the employee name age gender we have the city salary and in total we have inserted 20 records now let me run a few SQL commands to check how the structure of our table is let’s say I want to see the distant cities that are present in our table so I’ll write select distinct City from employees if I run on this you see here there are total eight different cities present in our employees table so we have Chicago the Seattle Boston we have New York Miami and Detroit as well now let’s see you want to know the total number of departments that are present so you can use distrct Department if I run this all right you can see we have seven rows returned and here are the department names so we have sales marketing product Tech it finance and HR all right now let me show you another SQL command now this is to use an aggregate function so I want to find the average age of all the employees from the table so I can write select AVG which is the aggregate function for average inside that I have passed my age column from employees if I run this so the average age of all the employees in our table is 33.3 now say you want to find the average age of employees in each department so for this you need to use the group by Clause I’ll give a comment here I want to find the average each in each department so I’ll write select Department comma I’ll write average of age from employees Group by department now if I run this you can see here we have our seven departments on the left and on the right you can see the average age of employees in each of these departments now you can see here in the output it says AVG of age which is not readable so I can give an alas name as average age all right I can bring this down and if you want you can round the values also so you can round the decimal places so I’ll use a round function before the average function and the round function takes two parameters one is the variable and the decimal place you want to round it to so if I run this there you go you can see here we have the average age of all the employees in each of these departments all right now suppose you want to find the total salary of all the employees for each department so you can write select Department comma Now I want the total salary so I’ll use the sum function and I’ll pass my column as salary from employees Group by Department let’s run this query you can see here in the output we have the different departments and on the right you can see the total salary of all the employees in each of these departments now here also you can give an alas name as total underscore salary let’s run it again and you can see the output here all right now moving ahead you can also use the aut by Clause along with the group by Clause let’s say you want to find the total number of employees in each City and group it in the order of employee ID so to do this I can use my select query I’ll write select count of let’s say employee ID and I want to know the city as well from employees Group by City And next you can use the order by Clause I’ll write order by count of employee ID and I’ll write DEC which stands for descending if I run this query you can see here on the left you have the count of employees and on the right you can see the city names so in Chicago we had the highest number of employees working that was four then we had Seattle Houston Boston Austin and the remaining also had two employees so in this case we have ordered our result based on the count of employee ID in descending order so we have the highest number appearing at the top and then followed by the lowest okay now let’s explore another example suppose we want to find the number of employees that join the company each year we can use the year function on the date of joining column then we can count the employee IDs and group the result by each year so let me show you how to do it so I’ll write select I’m going to extract Year from the date of join column I’ll give an alas as year next I’ll count the employee ID from my table name that is employees and I’m going to group it by Year date of join we give a semicolon all right so let’s run this great you see here in the result we have the year that we have extracted from the date of join column and on the right you can see the total number of employees that joined the company each year so we have in 2005 there was one employee similarly we have in 2009 there were two employees if I scroll down you have information of other years as well now if you want you can order this as well based on year or count okay now you can also use the group bu to join two or more tables together so to show you this operation let me first create a sales table so I’ll write create table sales and the sales table will have column such as the product ID which is going to be of integer type then we have the selling price of the product now this will be a float value then we have the quantity sold for each of the products so I’ll write quantity quantity will of integer type next we have the state in which the item was sold and state I’ll put it as worker and give the size as 20 let’s run this so that we’ll create our sales table all right so we have successfully created our sales table next we need to insert a few values to our sales table so I’ve have already written the records in a notepad let me show you okay so here you can see I have my sales text file let me just copy these information I’ll just paste it on the query editor okay now let me go ahead and run this insert command all right so you can see here we have successfully inserted nine rows of information so let me just

    run it through what we have inserted so the First Column is the product ID column then we have the selling price at which this product was sold then we have the quantity that was sold and in which state it was sold so we have California Texas Alaska then we have another product ID which is 123 and these are the states in which the products were sold so let me just confirm with the select statement I’ll write select star from sales I run this you can see we have successfully created our table okay now suppose you want to find the revenue for both the product IDs one to one and let’s say 1 to three since we have just two product IDs here so for that you can use the select query so I’ll write select product ID next I want to calculate the revenue so revenue is nothing but selling price multiplied by the quantity so I’ll use the sum function to find the total revenue and inside the sum function I’ll use my selling price column multiplied by my quantity column I’ll give this an alas name as revenue from my table name that is sales finally I’ll group it by product ID let’s run it there you go so here you can see we have the two product IDs one 121 and 1 12 3 and here you can see the revenue that was generated from these two products all right now let’s see we have to find the total profit that was made from both the products 1 to 1 and 1 to 3 so for that I’ll create another the table now this table will have the cost price of both the products so let me create the table first I’ll write create table let’s say the table name is C product which stands for the cost price of the products I’ll give my first column as product ID this will be an integer and I’ll have my second column as cost price cost price will have floating type values let’s run this so we have successfully created our product cost table now let me insert a few values into the C product table so I’ll write insert into ccore product I’ll give my values for one to one let’s say the cost price was $270 for each and next we have my product as 123 and let’s say the cost price for product 1 123 was $250 let’s insert these two values okay next we’ll join our sales table and the product cost table so this will give us the profit that was generated for each of the products so I’ll write select C do productor ID comma I’ll write sum s. cellor price now here C and S are alas names so if I subtract my cost price from the selling price that will return the profit that was generated I’ll multiply this with s do quantity close the bracket I’ll give an alas name as profit from sales as s so here s stands for the sales table I’m going to use inner join ccore product table as the Alias name should be C where s do productor ID is equal to C do productor ID we are using product underscore ID because this column is the common column to both the tables and finally I’m going to group it by C do productor ID all right so let me tell you what I have done here so I’m selecting the product ID next I’m calculating the profit by subtracting the cost price from the selling price and I multiplied the quantity column I’m using an join to connect my sales and the product cost table and I am joining on the column that is product ID and I have grouped it by c. product ID let’s run this there you go so here you can see for product id1 121 we made a profit of $1,100 and for product ID 1 123 you made a profit of $840 so now that we have learned Group by in detail let’s learn about the having clause in SQL the having clause in SQL operates on grouped records and returns rows where aggregate function results matched with given conditions only so now having and wear Clause are kind of similar but we Clause can’t be used with an aggregate function so here you can see the syntax of having Clause you have the select statement followed by the column names from the table name then we have the we conditions next we have the group bu finally we have having and at last we have order by column names so you can see here we have a question at hand we want to find the cities where there are more than two employees so you can see the employee table that we had used in our group by Clause as well so if you were to find the cities where there are more than two employees so this is how your SQL queries should look like so we have selected the employee ID and we are finding out the count using the count function next we have selected the city column from employees we have grouped it by City And then we have used our having Clause so we have given our condition having count of employee ID should be greater than two so if you see the output we have the different city names and these were the cities where the count of employees was greater than two all right so let’s go to our MySQL workbench and Implement how having works so suppose you want to find those departments where the average salary is greater than $75,000 you can use the having clause for this so let me first run my table which is employees if I run this you can see we had inserted 20 rows of information and the last column we had was salary so the question we have is we want to find those departments where the average salary is greater than $75,000 so let me show you how to do it so I’ll write select Department comma I’ll use the aggregate function that is average salary I’ll give an alas name as AVG underscore salary from employees next we’ll use the group by clause and I want to group it by each department and then I’m going to write my having Clause so in having Clause I’ll use my condition that is having average of salary greater than $75,000 let’s run it and see the output there you go so here you can see there were total three departments in the company that is sales finance and HR where the average salary is greater than $775,000 okay next let’s say you want to find the cities where the total salary is greater than $200,000 so this will again be a simple SQL query so I’ll write select City comma I want to find the total salary so I’ll use the sum function and I’ll pass my column as salary as I’ll give a alas name as total from employees Group by City And then I am going to use my having Clause I’ll pass in my condition as having sum of salary greater than $200,000 all right so let’s run this query there you go so so the different cities are Chicago Seattle and Houston where the total salary was greater than $200,000 now suppose you want to find the Departments that have more than two employees so let’s see how to do it I’ll write select Department comma this time since I want to find the number of employees I’m going to use the count function I’ll write count Star as employee uncore count or empore count which is my alas name from employees next I’ll group it by Department having I’ll give my condition count star greater than 2 let’s run this okay so you have departments such as sales product Tech and it where there are more than two employees okay now you can also use a wear Clause along with the having clause in an SQL statement so suppose I want to find the cities that have more than two employees apart from Houston so I can can write my query as select City comma count Star as EMP count from employees where I’ll give my condition City not equal to Houston I’ll put it in double code since I don’t want to see the information regarding Houston I’ll group it by City having count of employees greater than two so if I run this query you see we have information for cicago and cattl only and we have excluded the information for Houston now you may also use aggregate functions in the having Clause that does not appear in the select Clause so if I want to find the total number of employees for each department that have an average salary greater than $75,000 I can write it something like this so select Department comma count star as EMP count from employees Group by department and in the having Clause I’m going to provide the column name that is not present in the select expression so I’ll write having average salary greater than 75,000 this is another way to use the having Clause let’s run this all right you can see we have department sales finance and HR and you can see the employ count where the average salary was greater than 75,000 okay so let me run you from the beginning what we did in our demo so first we created a table called employee then we inserted 20 records to this table next we explored a few esql commands like distinct then we used average and finally we started with our group by Clause followed by looking at how Group by can be used along with another table and we joined two tables that was sales and product cost table to find out the profit then you learned how to use the having Clause so we explored several different questions and learned how to use having an SQL in this session we will learn about joints in SQL joints are really important when you have to deal with data that is present on multiple tables I’ll help you understand the basics of joints and make you learn the different types of joints with Hands-On demonstrations on MySQL workbench so let’s get started with what are joints in SQL SQL joint statement or command is often used to fetch data present in multiple tables SQL joints are used to combine rows of data from two or more tables based on a common field or column between them now consider this example where we have two tables an orders table and a customer table now the order table has information about the order ID which is unique here we have the order date that is when the order was placed then we have the shipped date this has information about the date on which the order was shipped then we have the product name which basically is the names of different products we have the status of delivery whether the product was delivered or not or whether it was cancelled then we have the quantity which means the number of products that were ordered and finally we have the price of each product similarly we have another table called customers and this customer table has information about the order ID which is the foreign key here then we have the customer ID which is the primary key for this table we also have the phone number customer name and address of the customers now suppose you want to find the phone numbers of customers who have ordered a laptop now to solve this problem we need to join both the tables the reason being the phone numbers are present in the customers table as you can see here and laptop which is the product name is present in the orders table which you can see it here so using a join statement you can find the phone numbers of customers who have ordered a laptop now let’s see another problem where you need to find the customer names who have ordered a product in the last 30 days in this case we want the customer name present in the customer’s table and the last 30 days order information which you can get from the order date column that is present in the orders table okay now let’s let’s discuss the different types of joints one by one so first we have an inner joint so the SQL inner joint statement returns all the rows from multiple tables as long as the conditions are met from the diagram ADB you can see that there are two tables A and B A is the left table and B is the right table the orange portion represents the output of an inner joint which means an inner joint Returns the common records from both the tables now you can see the syntax here so we have the select command and then we give the list of columns from table a which you can see here is the left table followed by the inner join keyword and then the name of the table that is B on a common key column from both the tables A and B now let me take you to the MySQL workbench and show you how inner join Works in reality so here I’ll type MySQL you can see I have got my SQL workbench 8.0 version installed I’ll click on it it will take some time to open okay I’ll click on this local instance and here I’ll give my password okay so this is how an SQL editor on my SQL workbench looks like so first of all let me go ahead and create a new database so I’ll write create database this is going to be my command followed by the name of the database that is going to be SQL joints I give a semicolon and hit control enter this will create a new database you can see here one row affected now you can check whether the database was created or not using show databases command if I run it here you can see I have SQL joints database created now I’ll use this database so I’ll write use SQL joints okay now to understand inner join consider that there is a college and in every College you have different teams for different sports such as Cricket football basketball and others so let’s create two tables cricket and football so I’ll write create table and my table name is going to be cricet next I’m going to create two columns in this table the First Column is going to be cricet ID then I’m going to give the data type as int and use the autoincrement operator I’m using Auto increment because my Cricket ID is going to be my primary key then I’m going to give the name of the students who are part of the cricket team and for this I’ll use war card data type and give the length as 30 I’ll give another comma and I’ll assign my Cricket ID as primary key within brackets I’ll give ccore ID cricket ID is nothing but a unique identifier for each of the players like you have role numbers in college okay let me just run it all right so we have successfully created our cricket table similarly let me just copy this and I’ll paste it here I’ll create another table called football this will have the information of all the students who are part of the football team and instead of cricket I am going to give this as football idid all right and the name column will have the names of the students I’ll change my primary key to football ID all right let me run this okay so now we have also created our football table the next step is to insert a few player names into both the tables so I’ll write my insert into command first let’s load some data to our cricket table so I’ll write cricet and I’ll give my name column followed by values and here I’ll give some names such as let’s say Stuart we give another comma the next player I’ll choose is let’s say Michael similar I’ll add a few more let’s say we have Johnson the fourth player I’ll take is let’s say hidden and finally we have let’s a Fleming okay now I’ll give a semicolon and run this okay so let me just check if all the values were inserted it properly for this I’ll use select star from table that is Cricket if I run it you can see I have created a table and have successfully inserted five rows of information now similarly let’s insert a few student names for our football table so I’ll change this to football and obviously there would be students who will be part of both cricket and football team so I’ll keep a few repeated names let’s say Stuart Johnson and let’s say Hayden are part of both cricket and football team then we have let’s say Langer and let’s say we have another player in the football team that is astral I’ll just run it okay you can see there are no errors so we have successfully inserted values to our football team as well let me just recheck it I’ll write select star from football all right so we have five players in the football team as well okay now the question is suppose you want to find the students that are part of both the cricket and football team in this case you can use an inner join so let me show you how to do it so I’ll write select star from cricket as I’m using an alias name as C which stands for Cricket then I’m going to write inner join my next table is going to be football as F which is an alas name for the football table then I’m going to use the on command or operator and then I’ll give the common key that is name here so C do name is equal to F do name So based on this name column from both the table my inner John operation will be performed so let’s just run it there you go so Stuart Johnson and Hayden are the only three students who are part of both the teams all right now you can also individually select each of the columns from both the tables so let’s say I write select c. ccore ID comma C do name comma F do football ID comma f. name from I’ll write Cricket as C inner join football as F on C do name is equal to F do name now if I run this you see we get the same output here as well all right now let’s explore another example to learn more about inner joints so we have a database called classic models let me first use classic models I’ll run this okay now let me just show the different tables that are part of classic tables all right so here you can see there are tables like customers there’s employees office there’s office details orders payments products and product lines as well all right so let me use my select statement to show what are the columns present in the products table okay so this product table has information about different product names you have the product code now this product code is unique here we also have the product vendor a little description about the product then we have the quantity in stock buying price and MSRP let’s see what we have in product lines if I run it you see here we have the product line which is the primary key for this table then we have the textual description for each of the products this is basically some sort of an advertisement all right now suppose you want to find the product code the product name and the text description for each of the products you can join the products and product lines table so let me show you how to do it I’ll write my select statement and choose my columns as product code then we have product name and let’s say I want the text description so I’ll write this column name okay then I’ll use from my first table that is products inner join product lines I can use using the common key column that is product line close the bracket I’ll give a semicolon and if I run it there you go so you can see the different product codes then we have the different product names and the textual description for each of the products so this we did by joining the products table and the product lines table all right now suppose you want to find the revenue generated from each product order and the status of the product to do do this task we need to join three tables that is orders order details and products so first let me show you what are the columns we have in these three tables you have obviously seen for the products table now let me show you for orders and Order details table so I’ll write select star from orders if I run it you can see it has information about the order number the date on which the order was placed we also have the shipment date we also have the status column which has information regarding whether the order was shipped or cancelled then we have some comments column we also have the customer number who ordered this particular product similarly let’s check what we have under order details so I’ll write select star from order details if I run it you can see it has the order number the product code quantity of each product we have the price of each product then we have the order line number okay so using the product orders and Order details let’s perform an inner join so I’ll write select o do order number comma o do status comma I need the product name which I’ll take from the products table so I’ll write P do product name now here o p are all alas name for the tables orders products and I’ll use OD for order details comma since we want to find the revenue we actually need to find the product of quantity ordered into price of each product so I’ll use a sum function and inside the SU function I’ll give quantity ordered multiplied by the price of each item I’ll use an alas as Revenue then I’ll use my from Clause from orders as o inner join order details as I’ll use an alas name as OD on I’ll write o do order number is equal to OD do order number I’ll use another inner join and this time we’ll join the products table so I’ll write inner join products as p on P do product code is equal to OD do product code and finally I’ll use the group by clause and group it by order number all right let me run this okay there’s some mistake here we need to debug this it says you have an error in your SQL syntax check the manual all right okay I think the name of the tables is actually orders or not order all right now let’s run it okay there’s still some error it says classic models. product doesn’t exist so so again the product name is I mean the table name is products and not product so let’s run it again all right there you go so we have the order number the status the product name and the revenue this we got it using inner join from three different tables now talking about left joins the SQL left join statement returns all the rows from the left table and the matching rows from the right table so if you see this diagram you can see we have all the rows from the left table that is a and only the matching rows from the right table that is B so you can see this overlapped region and the Syntax for SQL left join is something like this so you have the select statement and then you give the list of columns from table a which is your left table then you use the left join keyword followed by the next table that is table B on the common key column so you write a do key is equal to B do key okay now in our classic models database we have two tables customers and orders so if you want to find the customer name and their order ID you can use these two tables so first let me show you the columns that are present in customers and orders I think orders we have already seen let me first show you what’s there in the customer table okay so you can see we have the customer number the name of the customer then we have the contact last name the contact first name we have the phone number then there’s an address column there are two address columns actually we have the city name the state and we have other information as well and similarly we have our orders table so I’ll write select start from orders so I’ll write select star from orders if I run this you can see these are the information available in the orders table okay so let’s perform a left join where we want to find the customer name and their order IDs so I’ll write select C do customer name or let’s say first we’ll choose the customer number comma then I want the customer name so I’ll write C do customer name then we have the order number column which is present in the orders table and let’s say I also want to see the status then I’ll give my left table that is customers as C left join orders as o on C Dot customer number equal to O do customer number let’s run it okay again there is some problem all right so the table name is customers let’s run it so there’s another mistake here this is customer number so B is missing cool let me run it all right so here you can see we have the the information regarding the customer number then the respective customer names we have the order number and the status of the shipment so if I scroll down you’ll notice one thing there are a few rows you can see which have null values this means for customer number 125 and for this particular customer name there were no orders and similarly if I scroll down you will find a few more null values you can see here there are two null Val values here for customer number 168 and 169 there were no orders available all right now to check those customers who haven’t placed any orders you can use the null operator so what I’ll do is here I’ll just continue with this I’ll use a where clause and write where order number is null now let me run this okay so here you can see there are 24 customers from the table that don’t have any orders in their names okay now talking about right joins so SQL right join statement returns all the rows from the right table and only matching rows from the left table so here you can see we have our left table as a and the right table as B so the right join will return all the rows from the right table and only the matching rows from the left table now talking about the syntax so here you can see we have the select statement followed by the select statement you’ll have the list of columns that you want to choose from table a write join table B on the common key column from both the tables all right now to show how write join works I’ll be using two tables that is customers and employees so let’s see the rows of data that are present in the customer table first so I’ll write select star from customers let’s run it so here you have the customer number the customer name then we have the phone number the address of the customers you also have the country to which the customer belongs to the postal code and the credit limit as well similarly let’s see for the employees table here I’ll change customer customers to employees let’s run it okay so we have the employee number the last name the first name you have the extension the email ID the job title and also reports to here means the manager okay so based on these two tables we’ll find the customer name the phone number of the customer and the email address of the employee and join both the tables that is customers and employees so let me show you the command so I’ll write select C do customer name comma then we have C do phone I’ll give a space here next I want the employee number from the employee table so I’ll write e do employee number comma e do email from customers as C right join employees as e on E do my common key column is employee number here so I’ll write e do employee number is equal to C dot we have sales Representatives employee number and I’m also going to order it by the employee number column okay so you can see I have my customer name selected from the customers table the phone number of the customer then we have the employee number and the email address so let me run it okay there’s some problem all right so the table name is customers actually let’s run it once again there you go so you can see here we have all the values selected from our right table which is the employees table you can see right on employees which means your employees table is to the right and then we have the customer name and phone numbers of the customers from the customer table which is actually your left table so you have a few employee number such as one2 this 1056 which don’t have any customer name or phone numbers okay so there’s another popular join which is very widely used in SQL known as self joints so self joints are used to join a table to itself so in our database we have a table called employees let me show you the table first all right so here you can see we have the employee number the last name the first name of the employee you have the email ID and here if you see we have a column called reports 2 now this you can think of as the manager column so the way to read is for example for employee number 1056 the manager is one2 so if you check for one2 we have Dane Murphy then if I scroll down let’s say for employee number 1102 yeah for employee number 1102 the manager is 1056 so here you can see who is at 1056 you have Mary Patterson similarly if I scroll down let’s say for employee number 11 188 we have the manager as 11 43 now if I check the table at 1143 we have Anthony bow so so the employee Julie feli reports to Anthony bow all right now suppose you want to know who is the reporting manager for each employee so for that you can use a self jooin so let me show you how to join this employees table I’ll write select and then I’m going to use a function called concat within brackets I’ll start with my alas name that is m dot then I’ll write last name I’m going to concat last name followed by a comma then I’ll have my first name I’ll close this bracket and then I’m going to give my alas name let’s say manager here comma next I’m going to concat the same last name and first name and this time I’m going to use a separate alas let’s say e which stands for employee so I’ll write e do last name comma and within single codes I’ll give my comma and then I’ll write e do first name I close this bracket I’ll give an alas as let’s say employee from I’ll write employees as e inner join employees as M on M do I’ll use my common key column as employee number so I’ll write M do employee number is equal to e do here I’m going to use the reports two column and then I’ll order it by let’s say manager okay now let’s run this there you go so you have your two columns as manager and employee so for employee Louie bonder the manager is zarad bonder similarly if I scroll down you have there are multiple employees reporting to this particular manager similarly we have our manager as Anthony bow and we have different employees who are reporting to this particular manager and so on all right now moving ahead now let’s see what a full join is so SQL full outer join statement returns all the rows when there is a match in either left or right table now you must remember that MySQL workbench does not support full outer join by default but there’s a way to do it so by default this is how the syntax of full outer joint looks like now this statement will work on other SQL databases like micros moft SQL server but it won’t work on MySQL workbench I’ll show you the right way of using full auto join on MySQL workbench so to show full outer join I’m going to first use a left join and then we’ll also use a right join and finally we’ll use a union operator so the union operator is used to combine the result set of two or more select statements so first of all let me write C do customer name so for this example I’m using the customer table and the order table comma o do order number so I just want to know the customer name and the order number related to the customer from I have customers as C left join I’ll write orders as o on C do customer number is equal to O do customer number let me just copy this and after this I’m going to use my union operator so Union operator is used to merge results from two or more tables so basically this performs a vertical join and next I am going to use my right join operation so here instead of left join I’ll write right rest all looks fine let me just run it there you go so we have successfully run our full outer join operation you can see we have the different customer names and the order that each customer had placed all right so that brings us to the end of our demo session so let me just run through whatever we did in this session so first we created a database called SQL joints then we created two tables like cricket and football then we had inserted a few rows to each of these tables then we used this table to learn about inner join next we used a database called classic models it had multiple tables so we explored all of these tables like products there was product lines orders customers and employees and learned how to use inner join left join self join right join as well as full outer join in this video we will learn what is a subquery and look at the different types of subqueries then we learn subqueries with select statement followed by subqueries with insert statement moving further we will learn subqueries with the update statement and finally we look at subqueries with delete statement all these we will be doing on our MySQL workbench so before I begin make sure to subscribe to the simply learn Channel and hit the Bell icon to never miss an update so let’s start with what is a subquery so a subquery is a select query that is enclosed inside another query so if I show you this is how the basic structure of a subquery looks like so here whatever is present inside the brackets is called as the inner query and whatever is present outside is called the outer query so first the inner query gets executed and the result is returned to the outer query and then the outer query operation is performed all right now let’s see an example so we have a question at hand which is to write a SQL query to display Department with maximum salary from employees table so this is how our employees table looks like it has the employee ID the employee name age gender we have the date of join Department City and salary now to solve this query my subquery would look like this so I’ll first select the department from my table that is employees where I’ll use the Condition salary equal to and then I’ll pass in my inner query which is Select Max of salary from employees so what this does is it will first return the maximum salary of the employees in the table then our outer query will get executed based on the salary returned from the inner query so here the output is department sales has the maximum salary so one of the employees from the sales department earns the highest of the maximum salary if you see in our table the employee is Joseph who earns $115,000 all right and Joseph is from the sales department now let’s see how this query works so here we have another question which is to find the name of the employee with maximum salary in the employees table so this is our previous employees table that we saw and to find the employee who has the maximum salary my subquery would look something like this so I’m selecting the employee name from my table that is employees where I’m using the Condition salary equal to and then then I’m passing in my subquery or the inner query so first I’m selecting the maximum salary this will return a particular value that is the highest salary from the table and if you see our table the highest salary is $115,000 so our query becomes select employee name from employees where salary equal to $115,000 so the employee name is Joseph here and that’s the output now if you want to break it down here you can see first the inner query gets executed so our SQL query will first execute the inner query that is present inside brackets select maximum salary from employees the result is $115,000 and then based on the returned result our outer query gets executed so the query becomes select employee name from employees where salary equal to $115,000 and that employee is Joseph all right now we’ll learn the different types of subqueries so you can write subqueries using select statement update statement delete and insert statement we’ll explore each of this with the help of example on my my SQL workbench so let’s learn subqueries with the select statement so subqueries are majorly used with the select statement and this is how the syntax looks like you select the column name from the table name then you have the WHERE condition followed by The Columns that you want to pass the operator and inside that you have the subquery so here is an example that we will perform on our MySQL workbench so in this example we want to select all the employees who have a salary less than average salary for all the employees this is the output so let’s do this on my MySQL workbench all right so let me log into my local instance I’ll give my password okay so you can see I’m on my MySQL workbench so let’s start by writing our subquery using the select statement okay so for this demo session we’ll be using a database that is subqueries you can see it here I have a database called subqueries so I’ll use this subqueries database and we’ll create a few tables as well okay if I run it now we are inside the subqueries database so let me just show you the tables that are present inside this database I’ll write show tables if I run it okay there are two tables employees and employees undor B uh we’ll use this table throughout our demonstration all right now for our select subquery we want to fetch the the employee name the department and the salary whose salary is less than the average salary so we will be using the employees table so let me first show you the records and the columns we have in the employees table so I’ll write select star from employees and run it okay you can see here we have 20 rows of information we have the employee name the employee ID age gender date of join Department City and salary so this is the same table that we saw in our slide slides okay now for our subquery I’ll write select I want to choose the employee name the department and the salary there should be a comma here instead of a period next I’ll give my table name that is employees where my salary is less than and after this I’ll start my inner query or the subquery I’ll write select average salary so I’m using the AVG function to find the average salary of all the employees from my table that is employees if I give a semicolon and run this you’ll see the output so we have total 12 employees in the table whose salary is less than the average salary now if you want you and check the average salary so the average salary is $753 now the employees who have a salary less than the average salary so these are the people all right now moving back to our slides okay now let’s see how you can use subqueries with the insert statement now the insert statement uses the data return from the subquery to insert into another table so this is how the syntax looks like so you write insert into table name followed by select individual column so start from the table use the wear clause and then you give the operator followed by the inner query or the subquery so here we will explore a table called products table we are going to fetch few records from the products table based on a condition that is the selling price of of the product should be greater than $100 so only those records will fetch and put it in our orders table all right so we are going to write this query on my MySQL workbench so let’s do it I’ll give my comment as update subquery all right so first of all let’s create a table that is products so I’ll write create table products then we’ll give our column names the First Column would be the product ID of type integer then we have the column as item or the product which is of type Vare 30 next we have the selling price of the product the selling price will be of type float and finally we have another column which is called the product type and again product type is of the data type bar car I’ll give the size as 30 close the bracket and give a semicolon now let’s just run it okay so we have successfully created our products table now let’s insert a few records to our products table so I’ll write insert into products for followed by values I’ll give four records the first product ID is 101 the product is let’s say jewelry then the selling price is let’s say $800 and the product type is it’s a luxury product next let’s insert one more product detail the product ID is 102 the product is let’s say t-shirt the price is let’s say $100 and the product type is non-luxury next I’ll just copy this to reduce our task we’ll edit this the third product’s ID is 103 the product is laptop and let’s say the price is $1,300 and it’s a luxury product I’ll paste again and finally I’ll enter my fourth product which is let’s say table and the price is $400 and it’s a non-luxury product I’ll give a semicolon and we’ll insert these four records to our products table you can see see we have inserted four records let’s just print it now so I’ll write select star from products if I run it you can see we have our four products ready now we need to create another table where we are going to put some records from our products table so that new table is going to be the orders table so I’ll write create table orders now it will have three columns the order ID order ID will be of type integer then we have product underscore sold this will be of type varing character of size 30 and finally we have the selling price column this will be of type float let’s create our orders table the table name should be orders and there is some mistake here okay we should close the brackets okay let me run it so we have our orders table ready now let’s write our insert subquery so I’m going to insert into my table that is orders and I’ll select the product ID comma the item and the selling price or the sell price from my table that is products where I’ll write product ID in I’ll write my inner query select prodad ID or the product ID from products next I’ll give a wear Clause where the selling price is greater than $11,000 so let me tell you what I’m going to do here I’m going to insert into my orders table the product ID the item name and the s selling price from my products table where the product ID has this condition so let me first run this condition for you which is Select prod ID from products where the selling price is greater than 1,000 if I run this okay there is some issue here the column name is actually prodad ID now let’s run it again so that we can see the product IDs of the products which have a selling price greater than 1,000 so it is 101 and 103 now let’s run the entire query there is another mistake here let’s debug the mistake now this should be product ID instead of product _ in let’s insert again all right so we have successfully inserted two records to our table that is orders now let’s see the orders table I’ll write select star from orders if I run it there you go so there were two products from our product table that were jewelry and laptop which have a selling price greater than $11,000 so the selling price for jewelry was $1,800 and for laptop it was $1,300 so this is how you can use a subquery using the insert statement all right now going back to our slides again all right now let’s see how you can use subqueries with the update statement now the sub sub queries can be used in conjunction with the update statement so either single or multiple columns in a table can be updated when using a subquery with the update statement so this is how the basic syntax of an update subquery looks like so you write update table followed by the table name you set the column name you give the we operator and then you write your inner subquery so we are going to see an example where we’ll use this employees table and using this employees table we will update the records of the salaries of the employees by multiplying it with a factor of 35 only for those employees which have age greater than 27 so we are going to use a new table called employees Corb for this as well so let’s see how to do it so I’ll give my comment as update subquery before we see the subquery let’s see what we have in the table employees Corb this is basically a replica of the employees table there you go it has the same records that our employees table has we are going to use both the employees table and the employees _ B table to update our records so I’ll write update employees set salary equal to let me bring this to the next line I’ll write set salary equal to salary multiplied by 35 where age in then I’ll write select age from my other table that is employees uncore B where age is greater than equal to let’s say 27 all right so let me run through this query and tell you what we are going to do so I’m going to update the records of the employees table specifically for the salary column so I’m checking if the age is greater than 27 then we’ll multiply the salaries of the employees with a factor of. 35 in the employees table let me just run this then we’ll see our output okay so it says 18 rows affected which means there are total 18 employees in the table out of the 20 employees whose age is greater than 27 now if you see I’ll write select star from employees you can see the difference in the salaries if I scroll to the right you can see these are the up updated salaries okay now if you check for employees who have an age less than or equal to 27 for example Marcus whose age is 25 his salary is the same we haven’t updated his salary then if you see if you have okay there is one more employee Maya we haven’t updated the salary of Maya because the age is less than 27 all right now let’s go back to our slides again as you can see we got the same output on our MySQL workbench now let’s explore how you can write subqueries with the delete statement now sub queries can again be used in conjunction with the delete statement so this is how the basic syntax of a delete query using subquery would look like you write delete from the table name where Clause the operator value followed by the inner query in within brackets so here we are going to use the employees table and what we are going to do is we’ll delete the employees whose age is greater than equal to 27 so let’s see how you can do it all right so I’ll give my comment as delete subquery so we’ll follow the syntax tax that we saw I’ll write delete from my table name that is employees I’ll write where age in and then I’ll start my inner query or the subquery I’ll write select age from employees uncore B where AG is let’s say greater than equal to 32 or let’s say the AG is less than equal to 32 close the bracket and I’ll give my semicolon let me first run the inner query for you so that you get an idea of the employees who are less than 32 years of age so there are nine employees in the table who have an age less than equal to 32 so we are going to delete the records if I run this okay it says nine records deleted now let’s print or display what we have in the employees table if I run this there you go so if you see the age table we have total 11 employees now and all their ages are greater than 32 because we have deleted all those employees who had an age less than equal to 32 okay so let me show you from the beginning what we did so first we used our subqueries database then we used our employees table so we started by looking at how you can use the subquery with a select statement this should be insert instead of update so we learned how to write an insert subquery we use two tables products and our a table moving ahead we saw how to write subqueries using the update command so we updated the salaries of the employee by a factor of. 35 for those who had an age greater than equal to 27 and finally we saw how to use the subquery using the delete statement so we deleted all those records for the employees whose age was less than equal to 32 so let’s start with what is normalization normalization in dbms is a method used to organize data within database to reduce repetition by breaking down large data sets into smaller more manageable tables and ensuring these tables are properly related normalization helps prevent issues like data rency data rency means the unnecessary repetition or duplication of data within a database for example when a same piece of data is stored in multiple places it can lead to inconsistencies and take up more storage space than needed for example Data rency before normalization you can see the table mentioned above where we have order ID customer ID customer name customer address product and quantity you might see some of the data which is being repeated again and again in the above table the customer address for John do is repeated three times let’s suppose if John do moves to a new address every occurrence of his address in the table must be updated if any instances missed during the update it leads to inconsistencies and errors can occur in the database the solution is reducing the rency through normalization let’s check it out how so you can see this is the normalized table we have created first is the normalized customer table and then we have the order table so what are the benefits of normalization the address for JN do is stored only once in the customer table if JN do address changes it needs to be updated in one place ensuring consistency through the database this reduces the risk of errors and maintains data Integrity the process involves multiple steps that transform data into a tab below format removing duplicates and establishing clear connections between different tables making the database more efficient and reducing problems like errors during data insertion updates or deletion let’s now discuss the types of dbms normal forms normalization rules are categorized into different normal forms the first one is one and if for a table to be in first normal form it must satisfy the four rules single valued Atomic attributes each column should contain only one value per row this means that there should be no repeating groups or arrays within a single column same domain values all values stored in a specific column should be of the same data type or domain for example if a column is meant to store dates all values in that column should be dates then we have unique column names each column in the table should have a unique name this ensures Clarity and avoids confusion when referring to a specific column then we have order of data which doesn’t matter the order in which rows are stored in the table should not affect the data or its Integrity let’s check the example of the first normal form consider the following unnormalized table customer ID customer name and the phone numbers as you can see the phone numbers are repeated twice the problems with the original table is the nonatomic values the four numbers column contain multiple phone numbers separated by commas which violates the atomicity rule of 1 andf converting to First normal form to bring this table into one and F we must ensure that each column contains only Atomic value this involves splitting the rows where there are multiple phone numbers as you can see we have splitted the data each row now has a single phone number ensuring that the phone number column contains Atomic value same domain names all the values in the phone number column are consistent in format and type all are phone numbers then we can see that the unique column names the colums customer ID customer name phone number which has unique name satisfying the requirement order of data the order in which the rules appear does not matter as the data’s meaning and integrity are preserved by applying these rules the table now confirms the first normal form eliminating any rency related to the four numbers and ensuring data is stored in a more organized and efficient manner let’s go through each of these database normal forms step by step with simple examples to help you grasp the concepts more easily let’s talk about the second normal form for a table to be in second normal form it must satisfy the following condition number one it must be in one and F number two no partial dependency every non key attribute should be fully dependent on the entire primary key not just part of it this rule applies primarily to tables with composite primary Keys example of second normal form is consider the following table that is in one NF the order ID product ID product name quantity and the supplier name the problems with this table is that the partial dependency the product name and the supplier name depend only on product ID not the entire bio primary key which is order ID and product ID this violates 2nf converting to Second normal form to bring the table into 2nf we separate the data into two tables to remove partial dependencies order table and the product table no partial dependency in the order table quantity is fully dependent on both order ID and product ID in the product table product name and supplier name are dependent only on the product ID this ensures that each each non key attribute is fully dependent on the primary key bringing the tables into 2 andf let’s now talk about the third normal form 3 andf for a table to be in third normal form it must satisfy the following condition number one it must be in 2 andf number two there should be no transitive dependency where non-key attributes depend on other non-key attributes rather than the primary key let’s check out the example of a third normal form consider the following table that is in 2nf the problems with the tnf table is that the transitive dependency the instructor name is dependent on the course name which is not directly on student ID or course ID and this violates 3 andf so how do we convert this into 3 andf to achieve 3 andf we split the table to remove the transitive dependency student course table and course table no transitive dependency now the student course table there are no non-key attributes depending on other non-key attributes the course Table stores the course and instructor information separately this structure eliminates transitive dependency uring the tables conform to 3 andf Let’s now talk about the boys called normal form which is bcnf bcnf is an extension of the third normal form 3nf a table is in bcnf if it is in 3nf and for every functional dependency a implies to B A should be a Super Key let’s check out the example of a boy Squad normal for bcnf so you can see this table here consisting of employee ID department and the manager the problem with this table is that the bcn a violation in this table Department determines manager but department is not a Super Key since employee ID is the primary key this violates bcnf so how do we convert this to bcnf to achieve bcnf we split the table to ensure that every determinant is a Super Key as you can see the employee table and the department table the super key requirement in the employee table employee ID is the primary key and in the department table department is now the primary key the decomposition ensures that every functional dependency is Satisfied by a Super Key meeting the requirements of bcnf let’s now talk about the fourth normal form which is 4nf a table is set to be in 4nf if it is in bcnf and has no multivalue dependencies so let’s consider an example of a fourth normal form consider a table where an employee can have multiple skills and work on multiple projects as you can see the employee ID skill and the project the problem with this table is that it is multivalue dependency an employee skill is independent of the project but both are stored in the same table this leads to multivalue dependency violating 4nf so in order to achieve 4nf we separate the skills and the projects into different tables the employee skill table and the employee projects table and now you can see that no multivalue dependency by separating the skills and the projects we eliminate multivalue dependencies ensuring the table conform to for and let’s now talk about the fifth normal form the employee skill table and the employee projects table so as you can see that no multivalue dependencies is there by separating the skills and the projects we eliminate multivalue dependencies ensuring the tables conformed to 4 and F now let’s talk about the fifth normal form which is 5 and f a table is said to be in fifth normal form if it is in forf and cannot be decomposed into any smaller tables losing information also known as joint dependency let’s consider an example of a fifth normal form this is a table here that records the relationship between suppliers parts and the project the problem with this table is that the join dependency the table has a complex relationship between suppliers parts and projects that can be decomposed further so how do we convert this into fifth normal form form in order to achieve 5 andf we break the table into smaller related tables the suppliers part table and the suppliers project table also Parts project table eliminating joint dependency by decomposing the table into three smaller tables we remove the complex relationship and eliminate the joint dependency ensuring the tables confirmed to 5 andf So currently I am on my MySQL workbench let me connect to the local instance so I’ll give my pass word I’ll click on okay all right so this is my my SQL workbench query editor so first we are going to learn subqueries let me give a comment and write subqueries all right so first of all let’s understand what a subquery is so a subquery is a query within another SQL query that is embedded within the where Clause from clause or having Clause so we’ll explore a few scenarios where we can use subqueries so for that I’ll be using my database that is SQL uncore intro so I’ll write my command use SQL uncore intro now this database has a lot of tables I’ll be using the employees table that is present inside SQL intro Let me just expand this and you can see here we have an employees table so let me first show you the contents within this table I’ll write select star from employees let me execute it okay you can see here we have the employee ID employee name age gender there’s date of join Department City and salary and we have information for 20 employees if I scroll down you can see there are 20 employees present in our table so let’s say you want to find the employees whose salary is greater than than the average salary in such a scenario you can use a subquery so let me show you how to write a subquery I’ll write the select statement in the select statement I’ll pass my column names that I want to display so the column names I want are the employee name then I want the department of the employee and the salary of the employee from my table name that is employees next I’ll use use a we condition where my salary should be greater than the average salary of all the employees so I’ll write salary greater than after this I’m going to write my subquery so I’ll give select average of salary from my table name that is employees and I’ll close the bracket and give a semicolon so what it does is first it is going to find the average salary of all the employees that are present in our table once we get the average salary number we’ll use this wear condition where salary is greater than the average salary number so the inside subquery let me run it first if I run this this gives you the average salary of all the employees which is $275,300 now I want to display all the employees who have salary greater than $75,500 so let’s run our subquery there you go so there are eight employees in our table who have a salary greater than the average salary of all the employees all right next let’s see another example suppose this time you want to find the employees whose salary is greater than John’s salary so we have one employee whose name is John let me run the table once again okay if I scroll down you see we have an employee as John you see this our employee ID 116 is John and his salary is $67,000 I want to display all the employees whose salary is greater than John’s salary so B basically all the employees who are earning more than $65,000 I want to print them so let’s see how to do it I’ll write select I want the employee name comma the gender of the employee I also want the department and salary from my table name that is employees I’ll write where salary is greater than I’ll start my opening bracket inside the bracket I’m going to give my inner query that is Select salary from employees where the employee name is John So within single quotations I’ll give John as my employee I’ll end with a semicolon so let me first run my inner query so this will give us the salary that John has which is $67,000 now I want the employees who are earning more than $667,000 so let’s run our subquery okay so you can see 12 rows returned which means there are 12 employees in our table who are earning more than $67,000 you see here all these employees have a salary greater than6 $7,000 okay now you can also use subqueries with two different tables so suppose you want to display some information that are present in two different tables you can use subqueries to do that so for this example we’ll use a database that is called classic models you can see the first database so let me use this database called classic Model models I’ll write use classic models now this database was actually downloaded from the internet there’s a very nice website I’ll just show you the website name so this is the website that is MySQL tutorial.org you can see here they have very nice articles blogs from where you can learn my SQL in detail so we have downloaded the database that is classic models from this website you see here they have a MySQL sample database if you click on this it will take you to the link where you can download the database so they have this download link which says download my SQL sample database and the name of the database is classic Models All right so we are going to use this classic models database throughout our demo session if I expand the tables section you can see see there are a lot of tables that are present inside this classic models database we have Cricket customers there’s employees office there’s orders order lines and many more so for our subquery we’ll be using two tables that is order details and products table first let me show you the content that is present inside the products table first if I run this you see here it says 110 rows returned which means there are 110 different products that are present in our table which has the product code the product name product line we have the product vendor description quantity and stock Buy price MSRP the other table we are going to use is order details which has the details of all the orders let me show you the records order details tables has okay so there are thousand records present in this table you have the order number the product code quantity ordered price of each item you have the order line number as well okay now we want to know the product code the product name and the MSRP of the products whose price of each product is less than $100 for this scenario we are going to use two different tables and we are going to write a subquery okay so if you see here in the order details table we have a column called price each I want to display the product code the product name and the MSRP of the products which have a price of each product less than $100 so the way I’m going to do is I’ll write select product code comma product name now one thing to remember that this product name is actually present inside our products table and product code is present in both the tables that is products and Order details here you can see this is the product code column comma MSRP which is present inside the products table again from my table that is products where I’ll write product code I’m going to use the in operator next I’ll write my inner query that is Select product code from my table order details where my price of each product is less than $100 let me run this okay so you can see there are total 83 products in our table which have a price less than $100 you can see the price here okay now we learn another Advanced Concept in SQL which is known as stored procedures I’ll just give a comment saying stored procedure okay so first let’s understand what is a stored procedure a stored procedure is an SQL code that you can save so that the code can be reused over and over again so if

    you want to write a query over and over again save it as a stored procedure and then call it to execute it so in this example I want to create a stored procedure that will return the list of players who have scored more than six goals in a tournament so I have a database is called SQL IQ these are a few databases that I’ve have already created so this database has a table called players if I expand the tables option you see we have a table called players and you can see the columns player ID the name of the player the country to which the player belongs to and the number of goals each player has scored in a particular tournament so I’ll write a store procedure that will return the list of top players who have scored more than six goals in a tournament so first of all let me Begin by using my SQL IQ database we’ll run it so now we are inside the SQL IQ database let me select star from players to show the values that we have in the players table you can see there are six players in our table we have the player ID the names of the players the country to which these players belong to and the goals they have scored so I’ll write a stored procedure so the stor procedure syntax is something like this it should start with a D limiter okay in the D limiter I’ll write Amberson erson next I’ll write create procedure followed by the procedure name let’s say I want to name my procedure as topor players next statement is begin after begin I’ll write my select statement I want to select the name of the player the country and the goals each player has scored from my table that is players where I’ll write goals is greater than six we give a semicolon then I’ll end my procedure with a d limiter that was done double Amberson next I’ll write D limiter and give a semicolon now the semicolon suggests this is a default DM there should be a space okay now let’s run our stored procedure there you go so you have successfully created our store procedure now the way to run a store procedure is you need to use the call method and give the procedure name that is topor players in our case with brackets and a semicolon let’s execute it okay there is some problem here so we made a mistake while creating a procedure the name of the column is goals and not go goal let me create that procedure again okay it says the procedure topor player already exists let’s just edit the procedure name instead of top player we’ll write it as top players and similarly we’ll edit here as well now let’s create it again okay now to call my procedure I’ll write call space followed by the procedure name which is topor players if I run this you can see we have two players in our table who have scored more than six goals so we consider them as the top players in a particular tournament all right now there are other methods that you can use while creating a stored procedure one of the methods is by using an in parameter so when you define an in parameter inside a stored procedure the calling program has to pass an argument to the stored procedure so I’ll give a comment stored procedure using in parameter all right so for this example I’ll create a procedure that will fetch or display the top records of employees based on their salaries so if we have a table in our SQL IQ database which is called employee details I’m going to use this table you can see we have the name of the employee the age sex then we have the date of join City and salary using this table I’ll create a procedure that will fetch or display the top records of employees based on their salaries and we’ll use the in parameter so let me show you how to do it I’ll write delimiter this time I’m going to use forward slash I’ll write create procedure followed by the procedure name let’s say SP for stor procedure sort by salary is the name of my procedure and inside this procedure I’ll give my parameter in I’ll create a variable V and assign a data type integer then I’ll write begin followed by my select statement where I’ll select the name age salary from my table name that is EMP details or employee details I’m going to order this by salary descending and I want to display limited number of Records so I’m using this limit keyword and my variable V which I created here here I end my select statement I end my stored procedure with forward slash and I’ll go back to my default delimiter that is semicolon all right so let me run this there should be a space here all right so let’s run this okay you can see we have successfully created our second stored procedure which is Spore sort by salary now you can also check whether the stored procedure was created or not here you have an option to see the stored procedures let me just refresh this and you can see we have three stored procedures that we have created so far one is Spore sort by salary the other two were topor play and topor players okay now let’s call our stor procedure I’ll write call space followed by the stored procedure name which is Spore sort by salary and inside this I’ll give my parameter which was actually V and this V we have used in limit let’s say I want to display only the top three records of the employees who have the top three highest salaries okay so let me run it there you go so ammy Sara and Jimmy were the top three employees who have the highest salary so you saw how you could use the in parameter in a stored procedure we created a variable and that variable we used in our select statement and we called our stored procedure and passed in that variable okay now instead of a select statement inside a stored procedure you can also use other statements let’s say update so I’ll create a stored procedure to update the salary of a particular employee so in this procedure instead of Select statement we’ll use the update command in this example we’ll use the in operator twice let me show you how to do it I’ll write my D limiter first which is going to be for slash then I’ll write create procedure my name of the procedure is going to be update salary and inside the update salary name I’ll write in and then temp underscore name which will be a temporary name variable and the type I’ll assign is varar 20 I’ll again use my in parameter I’ll write in next my other variable would be newcore salary and the data type would be float I’ll write begin and write my update command or update statement I write update table name that is employee details set salary equal to newcore salary where name is equal to my temporary variable that is tempore name so this is my update command and I’ll and the delimiter all right so let’s run this okay we have successfully created our stored procedure if I refresh this you can see I have my store procedure update _ salary okay now let’s say first of all I’ll display my record that are present inside employee _ details table okay so we have six rows of information let’s say you want to update the salary of employee Jimmy or let’s say Mary from 70,000 to let’s say 72,000 or let’s say 80,000 so I’ll call my store procedure that is update uncore for salary and this time I’m going to pass in two parameters the first parameter will be the employee name and next with a comma I’ll give my new salary that I want to so my employee name let’s say is Mary and the salary I want to be updated is let’s say $880,000 I’ll give a semicolon and I’ll run it you can see it says one row affected now let’s check our table once again there you go if you see this record for Mary we have successfully updated the salary to $80,000 now moving ahead we learn to create a stored procedure using the out parameter so I’ll give a comment stor procedure using out parameter Okay so so suppose we want to get the count of total female employees we will create total employees as an output parameter and the data type would be an integer the count of the female employees is assigned to the output variable which is total uncore emps using the into keyboard let me show you how to write a stored procedure using the out parameter so first I’ll declare my delimer to forward slash I’ll write create procedure followed by the procedure name it is going to be Spore count employees and inside this I’m going to give my out parameter and the variable name that is total uncore emps which is total employees and the data type will be integer next I’m going to write begin followed by my select statement that is Select I want the count of total employees and the output I’m going to put into my new variable that is total _ emps from my table that is empore details where sex is equal to F which means female I’ll give a semicolon next I’ll end it with the D limiter and I’m going to change the D limiter to a default D limiter that is colon so let me tell you what I’m doing here I’m creating a new stor procedure that is Spore count employees using this stored procedure I’m going to count the total number of female employees that are present in our table empore details so I’ve used my out parameter and I’m creating a new variable called total uncore emps the data type is integer here in the select statement I’m counting the names of the employees and the result I’m storing it in total _ emps I have used my wear condition where the gender of the sex is female so let’s run this okay so we have created our stored procedure let’s refresh this okay you can see we have our new stored procedure Spore count employees now to call it I’ll write call the name of the procedure that is countor Spore count employees within brackets I’ll pass in the param meter as at the rate fcor EMP I’ll give a semicolon then I’ll write select at the rate fcor EMP as female employees okay so as is an alias name let’s run this one by one first I’ll call my procedure and then we’ll display the total number of female employees you can see in our table we have three female employees all right now with this understanding let’s move on to our next Topic in this tutorial on Advanced SQL now we are going to learn about triggers in SQL so I’ll give a comment here triggers in SQL so first let’s understand what is a trigger so a trigger is a special type of stored procedure that runs automatically when an event occurs in the database server there are mainly three types of triggers in SQL we have the data manipulation trigger we have the data definition trigger and log triggers in this example we’ll learn how to use a before insert trigger so we will create a simple students table that will have the students role number the age the name and the students marks so before inserting the records to our table we’ll check if the marks are less than zero so in case the marks are less than Z our trigger will automatically set the marks to a random value let’s say 50 so let’s go ahead and create our table that is students all right so I’ll write create table student now this table will have the student role number the data type is integer we will have the age of the students again the data type is integer we have the names of the students so the third column would be name the data type would be variable or varying character size I’m giving it as 30 finally we have the marks as floating type so let’s create this table which is student so we have created our table now I’ll write my trigger command so trigger command will start with D limiter like how our usual stored procedures have next this time I’ll write create trigger then you you need to give the name of the trigger that is Mark underscore let’s say verify I’m going to use a before insert trigger so I’ll write before insert on my table name that is student next I’ll write for each row if new do marks is less than zero then we set new do marks equal to 50 so this is my condition first we’ll check before inserting if any student has marks less than zero will assign a value 50 to that student because usually the marks are not less than zero in any exam I’ll write end if semicolon and I’ll close the delimiter so this is my trigger command I’ll run it it says trigger already exists in this case we need to update the trigger name let’s say I’ll write marks _ verify uncore student for STD let’s run it again okay there is an error here because in our table the column name is Mark and not marks so here we need to change it as Mark instead of marks all right let’s run it okay so we have created our trigger now let me insert a few records to the student table so I’ll write insert into student I’ll write values it give the values as 501 which is the student role number the age is let’s say 10 the name is let say Ruth and the marks is let’s say 75.0 give a comma we’ll insert our second student record student role number is 502 age is 12 the name is let’s say mic and this time I’m purposely giving a value of minus 20.5 give another comma we’ll insert the third record for student role number 503 age is 13 the name is Dave and let’s say the marks obtained by Dave is 90 now we’ll insert our final record for student number 504 the age is 10 name I’ll enter as Jacobs and this time again I’m purposely giving the marks in negative 12 point let’s say 5 close the bracket and give a semicolon and I’ll run my insert statement okay so we have inserted four rows of information to our student table now let me run the select query I’ll write select star from student if I run this you see the difference there you go so originally we had inserted for 502 the marks was minus 20.5 and for 504 for Jacobs the marks was – 12.5 our trigger automatically converted the negative marks to 50 because when we created our trigger we had set our marks to 50 in case the marks were less than zero so this is how a trigger works now you can also drop a trigger or delete a trigger you can just write drop trigger followed by the trigger name in this case our trigger name is marks _ verore St I’ll just paste this here and if you run this it will automatically delete your trigger I give this as a comment okay now moving on now we are going to learn about another crucial concept in SQL which is very widely used this is known as views so views are actually virtual tables that do not store any data of their own but display data stood in other tables views are created by joining one or more tables I’ll give a comment as views in SQL okay now to learn views I’m going to use my table which is present inside classic models data datase now this database as I mentioned we had downloaded we had downloaded it from the internet so first of all let me write use classic models so I’ll switch my database first all right now we are inside classic models so here let me show you one of the tables which is called customers so I’ll write select star from customers okay I missed s here let’s run it again so this is my customer table which is present inside classic models database it has the contact last name the contact first name the customer name customer number we have the address State country another information now I’ll write a basic view command using this customer table the way to write is I’ll write create view followed by The View VI name which is cust _ details then you write as select I’m going to select a few column names from my original customer table which is this one so I need the customer name let’s say I need the phone number and the city so you have this information here you have the phone number and the city all right I’ll write from my table that is customers if I run this my view that is cust details will be created let’s run it there’s some error here because the name of the table is customers and not customer I’ll give an S and I’ll run it again all right so you can see we have created our view and to display the contents that are present inside our view I can write select star from followed by The View name that is custor details let’s run it there you go so we have the customer name the phone number and the City of the different customers that we have in our table all right now let’s learn how you can create views using joins so we’ll join two different tables and create a view so for that I’m going to use my products table and the products lines table I’m talking about the products table and the product lines table present inside classic models database so before I start let me display the records that are present inside the products table let’s run it so these are the different products you can see here now let’s see what we have in product lines table so we have the product line the text description and there’s some HTML description and image so I’ll create a view by joining these two tables and we’ll fetch specific records that are present in both the tables so let me first start by writing create view followed by The View name that is product underscore description as I’ll write select product name comma then I’ll write quantity in stock I also want the MSRP now these three columns are present inside the products table and next from the product l table I want the text description of the products so I’ll write from products table I’ll give an alas as P followed by Inner join my other table that is product lines as let’s say PL on the common column that is product line so P dot product line is equal to I’ll give a space PL do product line okay so here we have used an inner joint to fetch specific columns from both the tables and our view name is productor description let us run it all right so we have our view ready now let me view or display what is present inside our productor description view I like select star from productor description let’s run it there you go so we have the product name the quantity in stock MSRP and textual descriptions of the different products in the table okay now there are are a few other operations that you can perform let’s say you want to rename a view instead of productor description you want to give some other name so I’ll just give a comment rename description so to rename a description you can use the rename statement I’ll write rename table product underscore description Which is my old name I want to change this name to let’s say I’ll give vehicle description since all our products are related to some of the other vehicle so I’ll write vehicle description okay let us run it all right so here you can see I have renamed my view so here if I just refresh it and I’ll expand this you can see we have the Cur details view and we have the vehicle _ description view okay now either you can view all the views from this panel or you can use a command let’s say I’ll write display views is the comment now to show all the views you can use show full tables where table underscore type is equal to within single code I’ll write view so this is the command that will display all the views that are present inside a database there is some error here let’s debug the error this should be okay so instead of table types it should be table type equal to view let’s run it you can see the two different views that we have one is customer details another is vehicle _ description okay now you can also go ahead and delete a view for that you can use the drop command so I’ll write drop view followed by The View name let’s say I want to delete customer _ details or custor details view I’ll write drop View ccore details let’s run it you can see here we don’t have the custor details view anymore all right now moving to our final section in this demo here we will learn about Windows functions Windows functions were Incorporated in my SQL in the 8 .0 version so Windows function in my SQL are useful applications in solving analytical problems so using the employees table present inside my SQL intro database so we’ll find the total combined salary of the employees for each department so first let me switch my database to SQL undor intro database I’ll run it okay and display my table that is employee so here we have 20 employees in our table using this table we are going to find the combined salary of the employees for each department so we will partition our table by department and print the total salary and this we are going to do using some windows functions in MySQL so I’ll write select I want the employee name the age of the employee and the department of the employee comma next I’ll write the sum of salary over I want to partition it by department so I’ll write Partition by Department which is D and I’ll give an alas as total salary so that it will create a new column with the name total salary from my table that is employees the output will be a little different this time let’s execute it and see the result there you go so here we have created another column in our result that is total salary and for each of the employees and the respective departments we have the highest salary so in finance the highest salary of one of the employees was $155,000 similarly if I come down we have the highest salary from HR if I scroll further we have the highest salary from it marketing product sales and the tech Team all right now we’ll explore a function which is called row number now the row number function gives a sequential integer to every row within its partition so let me show you how to use the ru number function I’ll write select rore number function over my column would be salary so I’ll write order by salary I’ll give the alas as ronom give a comma and I want to display the employee name and the salary of the employee from my table that is employees and I’ll order by salary so let’s see how our row number function will create sequencial integers okay you can see here we have a row number column and we have successfully given row numbers to each of the records you can see it starts from one and goes up till 20 okay now this row number function can be used to find duplicate values in a table to show that first I’ll create a table I’ll write create table let’s say I’ll give a random name that is demo and let’s say we have in this table the student ID which is of type integer and we have the student name which is of type varar the size is 20 I’ll create the small table with a few records let’s create this table first now we are going to insert a few records to our demo table so I’ll write insert into demo values I’ll give one1 the name is Shane give a comma I’ll insert the second student name one2 the name is Bradley we give a comma this time for 103 we have two records let’s say the name of the student is her give a comma I’ll copy this and we’ll paste it again so we have duplicated 103 next we have 104 for the name of the student let’s say is Nathan then again let’s say for the fifth student which is Kevin we have two records I’ll copy this and I’ll paste it here let me give a semicolon and we’ll insert these records to our table demo all right now let me just run this table for you I’ll write select star from demo if you see this we have a few information that are duplicated in our table that is for student ID 103 and student ID 105 now I’m going to use my row number function to find the duplicate records present in my table I’ll write select student uncore ID comma student uncore name I’ll give another comma and write rore number over within brackets I’ll write Partition by store ID comma store name okay then I’ll write order by store ID close the bracket I’ll give an alas as rum from my table that is demo let’s just run it you can see here okay let me just delete n from here and do it again all right if you see here there is just one student in the name Shane we have one student in the name Bradley but here if you see for her the second record it says two which means there are two records for H and if I scroll down there is one record for Nathan and there are two records for Kevin which means Kevin is also repeated okay now we are going to see another Windows function that is called rank function in my SQL so the rank function assigns a rank to a particular column now there are gaps in the sequence of ranked values when two or more rows have the same rank so first of all let me create a table and the name of the table would be a random name we’ll give it as let’s say demo one and it will have only one column let’s say variable a of type integer we’ll create this table first okay now let’s go ahead and insert a few records to our table which is demo one so I’ll write value 101 102 let’s say 103 is repeated I’m doing this purposely so that in the output you can clearly distinguish what the rank function does next we have 104 105 we have 106 and let’s say 106 is also repeated finally we have 107 okay let me insert these values to my table that is demo one okay this is done now if I write select Vore a and use my rank function I’ll write rank over then I’ll order by my variable that is Vore a as an alas name let’s a test rank from my table that is demo one let me execute this and show you how the rank function works if I run this there you go so here if you mark So for variable a101 the test rank is 1 for 102 the test rank is two but for this value which is 103 the test rank is repeated because there was a repetition for 103 so we have skipped the rank four here for 104 the rank is 5 now for 105 the rank is 6 now for 106 again since the record was repeated twice we have skipped the eighth Rank and our rank function assigned the same value which is 7 for 106 and for the last value 107 the rank is 9 all right now moving ahead we’ll see our final Windows function which is called first value so first value is another important function in my SQL so this function Returns the value of the specified expression with respect to the first row in the window frame all right so what I’m going to do is I’m going to select the employee name the age and salary and I’ll write first underscore value which is my function and pass in my employe name and then I’ll write over order by my column that is salary descending I’ll give an alas as highest uncore salary from my table that is employees so let me run this and see how the first underscore value function works all right so in our table Joseph was the employee who had the highest salary which was $115,000 so the first value function populated the same employee name throughout the table you can see it here now you can also use the first uncore value function over the partition so let’s say you want to display the employee name who has the highest salary in each department so for that you can use the partition I’ll write select _ name comma I want the department and the salary comma I’ll use my function that is first underscore value follow by the name of the employee inside my first value parameter I’ll write over here I’m going to use partition I’m going to partition it by department since I want to know the employee name who has the highest salary in each department and I’m going to order by salary descending and I’ll give my alas again as highest salary from my table that is employees so let’s run this and see the difference in the output okay so as you can see here we have the employee who had the highest salary from each department so for finance Jack had the highest salary from HR it was Marcus similarly in it it was William if I scroll down for marketing it was John for product it was Alice who had the highest salary similarly in sales we had Joseph and in Tech we had Angela so this is how you can use the first uncore value function using partition all right so that brings us to the end of this demo session on our tutorial so let me just scroll through and show you what we did from the beginning first we learned about subqueries in SQL so we initially wrote a simple subquery and then we used our classic models database which was downloaded from the internet I’d also shown you the link from where you can download this database here we used two different tables and we performed a subquery operation we learned how to create stored procedures so we learned how you can use the in operator or the in parameter as well as the out parameter in store procedure after stored procedure we learned another crucial Concept in SQL which is called triggers now triggers are also special kind of store procedures so we saw how to write a before insert trigger you can see it here next we learned how to delete a trigger we also saw how to work with views in SQL so views are basically virtual tables that you can create from existing tables we also saw how you can use views using two different tables and an inner join and we learned how to display views how to rename view names how to delete a view and finally we explored a few Windows function in this tutorial we will learn how to work with databases and tables using SQL with python to do this demo we will be using our jupyter notebook and the MySQL workbench you can see it here so we will write our SQL queries in the jupyter notebook with python like syntax if you don’t have MySQL or jupyter notebook install so please go ahead and install them first while installing the MySQL workbench you’ll be asked to give the username and password let me show you so I am on my MySQL workbench so once you connect it will ask for the username and the password so I’ve given my username as root and password you can give while installing it we will be using the same user ID or the username and the password to make our connection so let’s get started with our Hands-On demonstration part first and foremost let me go ahead and import the necessary libraries I’ll give a comment as import libraries all right so first I’ll import MySQL do connector next from MySQL doc connector I’m going to import my error method or the error module next I want to import pandas as PD so let’s run this okay there is some error here this should be capital E and not small all right you can see I have imported my important libraries now I’m going to create a function that will help us create a server connection so I’ll write my userdefined function by using the DF keyword I’ll write create underscore Server uncore Connection this is going to be my function name and it will take in three parameters first is the host name next is the username and then we have the user password all right I’ll give a colon and then in the next line I’m going to Define a variable which is going to be connection and I’ll assign it to a value called none now we’ll be using exception handling techniques to connect to our MySQL server the tri block lets you test a block of code for errors and the accept block will handle the errors so I’ll write try and give a colon and then I’m going to reassign the connection variable to a method which is MySQL do connector do connect now this MySQL connector. connect method sets up a connection so it establishes a session with the MySQL server if no arguments are passed it uses the already configured or default values so here we are going to pass in three parameters the first is the host name I’ll write host equal to host name which is hostor name name I’ll give a comma then I’ll write user equal to user uncore name next will be my password and I’ll assign the value user _ password all right now I’m going to use a print statement and write mySQL database connection successful after this I’ll give my accept blog so I’ll use the keyword accept here I’ll write error as err give a colon and then I’m going to use the print statement here I’m going to use some print formatting techniques using the F letter I’ll write error colon and I’ll use curly braces give VR and then I’ll close the double codes after this I’m going to return my connection all right let me give a comment here we are going to assign our password so we need to put our MySQL terminal password so this password you assign it while installing MySQL workbench I’ll write PW and I’ll give my password which is simply at the rate 1 2 3 4 5 and then I’m going to give my database name so I’ll give database name here I’m going to write DB equal to this is the database I want to create which is going to be MySQL python let me just scroll this down okay now I’ll say connection equal to I’ll pass in my user defined function name which is create server connection and the parameters which are going to be Local Host that is my host name my username which is root and then I’ll give PW which is my password that is exact L simply at the rate 1 2 3 4 5 let’s just run it now okay there is an error here we need to remove this double quotation all right made another mistake here this this should be root okay you can see here my SQL database connection successful all right next we are now going to create a database that is MySQL _ python so I’ll give a comment create MySQL uncore python database again to create this database I’m going to create another user defined function using the DF keyword I’ll write the function name as create database passing the parameters as connection comma query give a semicolon and in the next line I’ll write cursor equal to I’m going to make the connection so I’ll write connection dot cursor and I’ll give the parenthesis so this mysql’s cursor of MySQL connector python is used to execute statements to communicate with the mySQL database the MySQL cursor class initiates objects that can execute operations such as the MySQL statements okay next I’m going to again use my try and accept block so I’ll write try give a coolon and here I’m going to use cursor do execute within that I’m going to pass in my query next I’ll use a print statement and the message I’m going to display is database created successfully after this I’m going to write my except block I’ll write accept error as err give a colon and then I’ll use a print statement I’ll write print I’ll use the formatting again error colon and I’ll write within single codes I’ll give curly braces err and then I’ll close the double codes next let’s use the variable create underscore database underscore query and here I’m going to write my SQL query to create the database so I’ll write create database and followed by that I’ll give my database name which is going to be MySQL python okay after this I’ll call my function which is create database and I’ll pass in the parameters the first one is connection and next the query qu is create _ database _ query let me just copy it and I’m going to paste it here all right so what I’m doing here is I am creating a new function that is to create a new database with the name MySQL undor python which you can see it here now this function takes in two parameters connection and query I’m using the connection. cursor function which is often used to execute SQL statements using Python language and then I have created my try and exer blocks so this Tri block statements will try to create my new database which is MySQL python in case it fails to create the new database the exer block will work so here I’m writing my SQL query to create a new database which is create database followed by the database name and I’m assigning it to a variable which is create data datase query and then I’m calling my function create database and passing in the two parameters connection and the query all right so let’s just run it all right you can see here it has created my database successfully now you can verify this by checking the MySQL workbench or the MySQL shell you can see on the MySQL workbench here on the left panel under schemas there is a database called MySQL python let me just expand it now we haven’t created any table so it’s not showing it now the next step we are going to connect to this database so let’s go ahead and connect to our database that we have just created I’ll write the comment as connect to database now to connect to a database I’m again going to create a userdefined function using the DF keyword I’ll write create underscore DB which is for database _ connection and the parameters it will take is the host name followed by the username then we have the user password and finally we have the database name I’ll give a colon in the next line I’m going to create my variable which is connect connection and then I’ll assign it to a value none after this I’m going to use my exception handling techniques so I’ll write my tri block first I’m going to reassign my connection variable using the MySQL connector method so I’ll write MySQL do connector do connect so this this method we’ll take in the parameters so first it will take the host name I’ll write host equal to hostor name I’ll give a comma next it will take the usern name so user equal to user name another comma next it will take the user password I’ll use pass WD equal to user uncore password we give another comma and this time is going to be the database name so I’ll write database equal to dbor name now let’s use the print statement and and the message we are going to print is mySQL database connection successful all right finally we’ll write my accept block I’ll write accept error as err give a colon and then I’ll use the print statement f F within double Cotes I’ll write error colon within single Cotes curly braces I’ll write err and we’ll close the double quotes finally this function will return the connection value all right let’s run it and there you go it has run successfully so we have connected to our database now it’s time for us to execute SQL queries I’ll give another comment saying execute SQL queries all right now to execute our SQL queries I’ll use another user defined function which is execute underscore query and I’ll pass in the parameters as connection and query give a colon I’m going to write cursor equal to connection do cursor now this is used to establish a connection and run SQL statements next we’ll use the try and accept block so I’ll write try cursor dot execute this will take in one parameter which is going to be my query and then I’ll write connection do commit which is another method now let’s use the print statement so I’ll write print let’s say the message would be query was successful and then we’ll write our accept block which is accept if the tri block doesn’t work through an error using the print statement within double codes inside the inside the curly braces I’ll write err and close the double codes all right so let’s run it okay so we have successfully created our various functions that we needed to create a database establish a connection and to execute our queries all right now it’s time for us to create our first table inside the MySQL _ python database so to do that I’m going to write my create command in SQL so first we are going to assign our SQL command to a python variable using triple codes to create a multi-line string so let me show you how to do that I’ll write my variable name which is going to be create orders table it is always recommended to use relevant variable names to make it more readable and now I’m going to use triple codes so the triple quote will ensure I can create my multi-line string inside the triple quote I’m going to write my create command which is create table here I’m going to create an orders table first and inside the orders table I’m going to create my column names the First Column would be the order ID it is going to be of type integer and I’ll assign this order ID as my primary key column we’ll give a comma next the second column would be customer underscore name the customer name column would be of type varing character so I’ll write varar and I’ll give a size of 30 and this is also going to be not null moving ahead my fourth column would be the product name column so I’ll write productor name product name will be of type varing character the size is let’s say 20 and it is also not null next I’m going to create my fourth column which is the date on which the item was ordered or the product was ordered so I’ll write date ordered the data type will be date next I’ll create a quantity column to keep track of the number of quantities that were ordered this is of type integer my next column would be unit price which will basically have information about the price of each unit of product unit price can be of type float and finally I’ll have the phone number of the customer I’ll write phone number phone number can be kept as of type varing character I’ve have assigned a size of 20 now let’s give a semicolon and we’ll close the the triple codes all right so this is how the syntax would look like next to run this we are first going to call our create DB function so let me give a comment as connect to the database I’ll write connection equal to create _ dbor connection my parameters would be my host name which is Local Host my username which is root comma my password and then my database name which is MySQL python so I’ll write just DB all right finally let’s execute this query using the execute underscore query function that we had created earlier this takes in two parameter the first one is connection followed by the variable name which is create orders uncore table let us run it okay there is some error here let’s see what’s the error okay so here we have put four double code this should be triple codes now let’s run it okay there is another here let’s debug it it says name cursor not defined let me just roll it to the above cell if you see here in our execute underscore query function instead of cursor I have written cursor so R is missing let’s redun this and now let’s run this again there you go you can see here my SQL database connection successful even our query was also successful now if you want to recheck if the table that is orders was created or not you can check it on the MySQL workbench so let me show you how to do it so I am on my MySQL workbench and under MySQL python database you have something called as tables let me just right click and I’ll select refresh all there you go you can see this Arrow just click on this arrow and here you can see we have a table called orders so we have created our table called orders now you can check the columns as well you have the order ID you have the order ID the customer name product name ordered date quantity unit price and phone number now it’s time for us to insert a few records to this table which is orders now to insert records I’ll give a comment as insert data I’ll start with the variable name let’s say the variable name is data underscore orders I’ll give triple Cotes next I’ll write my insert into command so I’ll write insert into my table name that is orders for followed by values and now I’ll start entering my records for each of the rows so first I’ll give one1 which is the order ID then I’ll give the customers’s name let’s say Steve and the product he had ordered is let’s say laptop then I’ll give my date in which the item was ordered let’s say it is 2018 I’ll choose 06 as the month and the date is let’s say 12 we give another comma this time we’ll pass in the quantity which is two let’s say the price of each laptop was $800 and we’ll give a phone number this is random let’s say 62 9 3 7 3 0 Let’s see 802 all right similarly I’m going to insert five more records of different customers and their items that they have purchased to this table orders so here on my notepad I have my rest of the five records let me just copy it and we’ll paste it in the cell here this will save us some time okay let me recheck if everything is fine I’ll give a comma here all right so we have six customers in our table which have their customer IDs from 101 to 106 you have Steve jaw Stacy Nancy Maria and Danny you have the different items they have purchased laptop books trousers t-shirts headphones and smart TV is the date on which they had ordered this item the number of quantities they had ordered and then we have the unit price and some random phone numbers so let’s create the connection now I’ll write connection equal to I’ll write create undor dbor connection then I’ll going to give my same parameters let me just copy it from the top is Local Host the host name root is my username then we have password and the database name and then I’ll use the same query as above which is execute query I’ll copy this paste it here and instead of of create orders table variable I’ll put as data _ orders so this will store my insert into command you can see the variable I’ve used here is dataor orders now it’s time let’s just run it all right there was some mistake here let’s debug it again this should be triple quotes and not four now let me rerun it again there you go you can see here my SQL database connection successful and my query was also successful now we’ll create another user defined function which will help us read query and display the results so I’ll write my function name as DF read uncore query this will take in two parameters connection and query then I’ll write cursor equal to connection do cursor I’ll put my result as none and then I’ll use my try and except block I’ll write try cursor dot execute this will take in one parameter which is query and then I’ll give another variable which is result equal to cursor dot fetch all now this fetchall method will return all the results in the table I’ll write return result next we’ll use the accept block so I’ll write accept error as ER give a colon and I’ll use my print statement just scroll this down I’ll use my formatting F error give a colon followed by a space within single Cotes inside curly pess I’ll give ER and close my double Cotes let’s run it all right so now we are all set now we are going to use our select Clause having whereby then we’ll see how to use Auto by Clause some inbuilt functions we’ll update some records delete some records and do a lot of other stuff so let’s start with our first query so our first query is going to be using the select statement all right so suppose I want to display all the records that we have inserted into our ords table so the way to do is I’ll assign my query to a variable let’s say q1 I’ll give triple quotes within triple quotes I’ll write select star from orders we give a semicolon followed by the triple codes now we’ll establish the connection so let me just go to the top and I’ll copy this line which is to connect to our database I’ll paste it here now we’ll create a variable called results that will store the result of this query and we are going to assign this variable to our function that is read query and this read query will have two parameters the connection and the variable name which is q1 for the query next to display I’m going to use a for Loop I’ll write for results for result in results print I’ll say result now we are done let’s just run this query there you go you can see here we have successfully printed all the rows in our table which is orders you can see we have six records in total now we are going to explore a few more queries so let me just copy this and we are going to edit in the same query I’ll paste it here next let’s say you want to display individual columns from the table and not all the columns so let me let me create the variable Q2 now instead of star I’m going to display only the customer name and let’s see the phone numbers of the customer so I’ll write phone uncore number all right the rest all Remains the Same let me just recheck it and here instead of q1 we’ll put Q2 and let’s run this cell all right you can see here now we have displayed only two columns the First Column is the customer name and then we have the respective phone numbers okay now let me just paste that query again now we are going to see how you can use an inbuilt function that is in our table we have the order date and from the order date we are only going to display the different ear that are present in the order date so to do that I’m going to use the year function I’ll edit this query instead of q1 I’ll make it Q3 and here I’m going to write select here which is my function name from my column which is date ordered from orders and here I’ll change this to Q3 q1 Q2 Q3 are basically query 1 query 2 and query 3 let’s run it there you go so we have successfully extracted the different years present in the order date column now if you want to display the distinct or the unique dates present in the column you can use the dextin keyword in the select statement so the way to do it is I’ll write select distinct give a space the rest of the query Remains the Same and here Q3 I’ll write Q4 I’ll make this as Q4 let’s run it you can see 2018 and 2019 are the unique year values that are present in the order date column okay now moving ahead let’s write our fifth query and this time we are going to explore how you can use the wear Clause so I’ll change this to Q5 before I write my query so let’s say you want to display all the orders that were ordered before 31st of December 2018 so to filter this we are going to use the wear Clause so I’ll write write select star from orders next I’ll write where my date underscore ordered is less than within course I’ll give my date value which is 2018 December 31st so all the items or the products that were ordered before 31st of December 20 18 will be displayed so let’s run it all right you can see here there are three orders in our table which have been ordered before 31st of December now moving ahead we want to display all the orders that were made after 31st of December so here what you can do is I’ll just copy the above query again I’ll copy this line so instead of less than 31st of December 2018 I’ll make it as greater than so every order that was placed after 31st of December will be displayed if you run it so you can see here there are three orders in our table which were ordered after 31st of December 2018 now moving ahead let’s write a seventh query now let’s see how the autoby Clause Works in SQL so you can filter your results based on a particular column or sort it based on a particular column so this is going to be my query 7 I’ll write it from scratch again let’s say you want to display all the columns from the table so I’ll write select star from orders then I’m going to use order by unit price I’ll give a semicolon let’s run this query and see the output now if you see the result here and you mark the unit price column the result has been ordered in ascending order of unit price you see here it starts with the lowest price and then goes on with the highest price towards the end if you want to order it in descending order you can use the keyword Dees C so this will ensure your top or the most expensive products appear at the top and the least expensive products appear at the bottom all right next now let’s see how you can create a data frame from the given table so as you know using jupyter notebook and pandas you can create data frames and work on it very easily so with this table also we can create our own data frame so for that let me create an empty list first I’ll write from DB equal to I’ll assign this as an empty list so we are going to return a list of lists and then create a pandas data frame next I’ll write my for Loop I’ll write for result in results I’ll assign result to list of results so I’m converting the result into a list and then I’m going to append it to the empty variable or the empty list which is from DB do append I’ll append the result to my empty list next we need to pass in the column now that will be part of our data frame so I’ll write columns equal to this column I’ll pass it within a list so I’ll give my first column as order ID then we have the customer name next I have my product name then I have the date on which it was ordered give a comma then we’ll have the quantity column let me write it in the next line next we have the unit price column and finally we have the phone number column so I’ll write within double quotes phone number and this we are going to assign it to a data frame so I’ll be using PD do data frame which is my function to convert a list into a data frame my variable I’m going to pass this from _ DB and I’ll write my next argument is columns equal to my variable name that is columns finally let’s display the data frame which is DF all right so here I’m creating a empty list first and then I am creating a for Loop and I’m appending the results to my empty list here you can see I have created my column list and using pd. data frame I’m converting the list into a data frame if I run this this is append and not append all right you can see we have our data frame ready this is the index column it starts from zero onwards and then we have the different column names okay now let’s see how to use the update command now suppose you want to change the unit price of one of the orders you can use the update command so the way to do it I’ll first create my variable let’s say update and I’ll give three codes or triple codes then I’ll use my update command which is update followed by the table name that is orders next I’ll write set let’s say unitor price if you see this let’s say I want to set the unit price of trousers from $50 to let’s say $45 I want to update this particular record so I’m going to write set unit price column equal to $45 where the order ID equal to 103 so this query will update the third row in our table which is order ID 103 so it will update from $50 to $45 I’ll close the triple quotes and now I’ll use the connection queries again let me just paste it here all right I’ll delete these three lines of code and instead of that I’ll put execute underscore query and this will take into parameters as always which is going to be connection followed by the variable name that is update let’s run it you see here it says mySQL database connection successful query was successful now you can recheck that to do it let me just go to the top and we’ll just copy our first query which is q1 I’ll copy this and I’ll paste it here let me just rename this now this will be Q8 and I’ll change this as well I’ll write select star from orders where my order ID equal to 103 let’s see the unit price of 103 now you can see here instead of 50 now we have updated it to $45 all right now the last command we are going to say is how you can delete a record from the table I’ll write delete command as my comment now to delete a query I’ll first give my variable name which is delete uncore order and I’ll pass in within triple quotes next I’ll write my delete query which is delete from my table name that is orders then I’ll give my we Clause where let’s say I want to delete my order ID 105 let me just go to the top and explain you again so if you see this we want to delete the order ID 105 which was for customer name Maria and she had ordered headphones we want to completely remove this particular record so I have my delete query ready now let me just create my connection and display the results so I’ll go to the top and I’ll copy this connection command which also has the execute query command and I’ll paste it here and I’m going to make a change here instead of update we’ll write delete underscore order everything looks good let’s just run it you can see our query was successful and now if you want to print it let me just show you I’ll just copy this we’ll paste it here I’ll make this as q9 I want to verify if my order ID 105 was deleted or not instead of this statement I’ll write select star from orders and here I’ll change this to q9 if I run this you can see it here you can Mark order ID 105 was deleted and it no more appears in this table all right so this brings us to the end of the demo session on SQL with python let me just scroll you through what we did so first we imported the important libraries MySQL connector then we imported the error function then we imported pandas using PD we learned how to create a server connection to mySQL database we created a new database that is MySQL Python and now we connected to that database we created a function to execute our queries we saw how you can write a create table command then we inserted a few records to our orders table we created a read uncore query command to read the queries and display the results then we started exploring our different SQL commands one by one we saw how to use select query then we selected a few individual columns from our table followed by using a inbuilt function which was ear then we saw how to use the distinct keyword after that we used our wear Clause to filter our table based on specific conditions we saw how to order your results based on a particular column then we saw how you could convert the table into a data frame using pd. dataframe function finally we learned how to use the update command and the delete command postl is a very popular and widely used database in the industries in this tutorial we will learn post SQL or post chis SQL in detail with an extensive demo session so in today’s video we will learn what post chis SQL is and look at the history of postris SQL we will learn the features of postris SQL and jump into performing postris SQL commands on the SQL cell and PG admin so let’s begin by understanding what is post SQL postc SQL is an open-source object relational database management system it stores data in rows with columns has different data attributes according to the DB engines ranking postris SQL is currently ranked fourth in popularity amongst hundreds of databases worldwide it allows you to store process and retrieve data safely it was developed by a worldwide team of volunteers now let’s look at the history of postr sequel so in 1977 onwards the Ingress project was developed at the University of California Berkeley in 1986 the post Chris project was led by Professor Michael Stonebreaker in 1987 the first demo version was released and in 1994 a SQL interpreter was added to postris the first postris SQL release was known as version 6.0 or 6.0 on January 29 1997 and since then postr SQL has continued to be developed by the post SQL Global Development Group a diverse group of companies and many thousands of individual contributors now let’s look at some of the important features of postest SQL so postest SQL is the world’s most advanced open source database and is free to download it is compatible as it supports multiple operating systems such as Windows Linux and Macos it is highly secure robust and reliable postp SQL supports multiple programming interfaces such as C C++ Java and python postp SQL is compatible with various data types it can work with Primitives like integers numeric string and Boolean it supports structured data types such as dat and time array and range it can also work with documents such as Json and XML and finally postris SQL supports multiversion concurrency control or mvcc now with this Theory knowledge let’s look at the post SQL commands that we will be covering in the demo so we will start with the basic commands such as select update and delete we will learn how to filter data using where clause and having clause in SQL we will also look at how to group data using the group by clause and order the result using the order by Clause you will learn how to deal with null values get an idea about the like operator logical operator such as and and or we will also explore some of the popular inbuilt mathematical and string functions finally we’ll see some of the advanced concepts in postris SQL that is to write case statements subqueries and user defined functions so let’s head over to the demo now okay so let’s now start with our demo so first we’ll connect to post SQL using psql cell so here under type here to search I’ll search for psql you can see this is the SQL cell I’ll click on open let me maximize this okay so for Server I’ll just click enter database I’ll click enter port number is already taken which is 5432 I hit enter username is already given and now it is going to ask for password so here I’ll give my password so that I can connect to my post SQL database so it has given us a warning but we have successfully connected to post SQL all right so now to check if everything is fine you can just run a simple command to check the version of post SQL that we have loaded so the command is Select version with two brackets and a semicolon I’ll hit enter okay you can see the version post SQL 13.2 okay now let me show you the command that will help you display all the databases that are already there so if I hit slash L and hit enter it will give me the list of databases that are already there so we have post SQL there’s something called template 0o template 1 and we have a test database as well okay now for our demo I’ll create a new database so first I’ll write create space database and I’ll give my database name as SQL uncore demo I’ll give a semicolon and hit enter you see we have a message here that says create database so we have successfully created our SQL demo database now if you want to connect to that database you can use back/ c space SQL uncore demo there you go it says you are now connected to database SQL demo so here we can now create tables we can perform insert operation select operation update delete alter and much more now I’ll show you how to connect to post SQL using PG admin so when you install the post SQL database you will get the SQL cell and along with that you also have the PG admin so I’ll just search for PG you can see here it has prompted PG admin I’ll click on open this will open on a web browser you can see it has opened on Chrome and this is how the interface of PG admin looks like it is a very basic interface so on the top you can see the files we have object this tools and we have the help section as well and here you have dashboard properties SQL statistics dependencies dependence and here on the left panel you have servers let me just expand this so it will connect to one of the databases all right so if I go back you see when I had run back/ L to display the databases it had shown me post SQL and test now you can see here we have the post SQL database and the test database all right now we also created one more database which was SQL demo so let me show you how to work on this PG admin and the query tool all right so I’ll right click on SQL demo and I’ll select query tool I’ll just show you how to run a few commands on the query tool so let’s say you want to see the version of post SQL that you are using so you can use the same command that we did on psql Cell which is Select version closed with brackets and a semicolon I’ll select this and here you can see we have the execute button so if I hit execute or press F5 it will run that query you can see we have the output at the bottom and it says post SQL 13.2 compiled by visual C++ it has the 64-bit system okay now let me tell you how to perform a few basic operations using postr SQL commands so here let’s say I’ll write select 5 into 3 I’ll give a semicolon select this and hit F5 so this will run the query and it returns me the result that is the product of 5 and three which is 15 similarly let’s edit this let’s say I’ll write 5 + 3 + let’s say 6 I’ll select this and hit F5 to run it it gives me the sum of 5 + 3 + 6 which is 14 now the same task you can do it on this cell as well let me show you how to do it here so let’s say I’ll write select let’s say I want to multiply 7 into let’s say 10 you know the result it should be 70 if I hit enter it gives me 70 now this question mark column question World we’ll deal with this later all right let me go back to my PG admin again let me do one more operation let’s say this time I’ll write select 5 multiplied by and within brackets I’ll write 3 + 4 I’ll give a semicolon so what SQL will do is first it will evaluate the expression that is there inside the bracket that is 3 + 4 which is 7 and then it will multiply 7 with 5 now let me select this and I’ll hit execute so you can see 7 * 5 is 35 all right now we’ll go back to our shell and here I’ll show you how to create a table so we are going to create a table called movies on the cell that is psql cell so here we will learn how you can create a table and then you can enter a few data into that table all right let me just scroll down a bit okay so my create command goes something like this so I’ll write create table followed by the table name that is movies next my movies table will have a few columns let’s say I want the movie ID after the column name we need to give the data type so movie ID I’ll keep it as integer so integer is one of the data types that is provided by postr SQL next my second column the table would be the name of the movie so I’ll write moviecore name so all the variables or the column names should be as per SQL standards so there shouldn’t be any space between the column names so I have used underscore to make it more readable so my movie name will be of type varar or variable character or varing character and I’ll give the size as 40 so that it can hold 40 characters maximum next my third column will have the genre of the movie so I’ll write moviecore joner again joner is of type barar I’ll give the size as let’s say 30 and my final and the last column will have the IMDB ratings so I’ll write IMDb underscore ratings now the ratings will be of type real since it can have floating or decimal point values if I close the bracket I’ll give a semicolon and I’ll hit enter there you go so we have successfully created a table called movies now let me go back to my PG admin all right so here I have my database that is SQL demo I’ll just right click on this and click on refresh now let me go to schemas I’ll just scroll down a bit here under schemas we have something called as tables let me expand this okay so you can see we have a table called movies in the SQL demo database now and here you can check the columns that we have just added so our movies table has movie ID movie name j and readings all right now there is another way to create a table the previous time we created using the SQL cell now I’ll tell you how to create a table using the PG admin so here under tables I’ll right click and I have the option to create a table so I’ll select table okay so it’s asking me to give the name of the table so this time we are going to create a table called students so I’ll write my table name as students all right these will be default as it is now I’ll go to the columns tab so here you can create the number of columns that you want so you can see on the right I have a plus sign I’ll just select this so that I can add a new row so my first column would be let’s say the student role number I’ll write student underscore RO number again the column name should be as per SQL standards the data type I’m going to select is integer all right now if you want you can give these constraints such as not null so that student R number column will not have any null values and I’ll also check primary key which means all the values will be unique for role numbers all right now if you want to add another column you can just click on that plus sign and let’s say this time I want to give the student name as my second column so I’ll write student underscore name student name will be of type let’s say character wearing if you want to give the length you can specify the length as well let’s say 40 I’ll click on the plus sign again to add my final column the final column would be gender so gender I’ll keep this time as type character okay now you can click on save so that will successfully create your students table there you go so here on the left panel you can see earlier we had only one table that was movies and now we have two tables so one would be added that was students so if I expand this under columns you can see we have the three columns here student rule number student name and gender you can also check the constraints it will tell you if you have any constants so you can see it says students rule number there’s one primary key all right all right now let me run a select statement to show the columns that we have in the movies table so I’ll write select star from movies give a semicolon and let me execute this okay so here on the at the bottom you can see we have the movie ID the movie name movie Jor and IMDb readings now the next command we are going to learn is how to delete a table so there is one way by using the SQL command that is drop table followed by the table name let’s say you want to delete students you can write drop table students and that will delete the table from the database this is one of the methods so you just select and run it now the other way is to you just right click on the table name and here you have delete slash drop if I select this you get a prompt are you sure you want to drop table students I’ll select yes so you can see we have successfully deleted our students table all right now let’s perform a few operations and learn a few more commands in post SQL so to do that I’m going to insert a few records to my movies table so for that I’ll use my insert command so I have my insert query written on a notepad I’ll just copy this and I’ll paste it on my query editor okay so let me just scroll down all right so here you can see I have used my insert command so I have written insert into the name of the table that is movies and we have the movie ID the movie name movie Jer and IMDb readings and these are the records or the rows so we have the first record as movie ID 101 the name of the movie is a very popular movie which is vertigo then we have the movie genre that is Mystery it is also a romance movie and then we have the IMDb readings the current IMDb readings that is 8.3 similarly we have sank Redemption we have 12 Angry Men there’s the Matrix seven inter staler and The Lion King so there are total eight records that we are going to insert into our movies table so let me just select this and hit execute okay you can see it has returned successfully eight records now if I run select star from movies you can see the records that are present in the table so I’ll write select star from movies I’ll select this and I’ll execute it there you go at the bottom you

    can see eight rows affected if I scroll this down you have the eight records of information in the movies table all right now if you want to describe the table you can go to the SQL cell and here if you write back SL D and the name of the table that is movies this will describe the table so here you have the column names this has the data type and here you can specify if there are any null values or any con constraints like default constraint or primary key or foreign key and others let me go back to my PG admin okay now first and foremost let me tell you how to update records in a table so suppose you have an existing table and by mistake you have uh entered some wrong values and you want to update those records later you can use the update query for that so I’m going to update my movies table and I’ll set the genre of movie ID 103 which is 12 Angry Men from drama to drama and crime so in our current Table we only have jonre as drama for 12 angry man I’m going to update this column which is the movie genre to drama and crime okay so let me show you how to do it I’ll write update followed by the name of the table that is movies go to the next line I’ll write set then I’ll give the column name which is moviecore Jer equal to I’m going to set it as drama comma crime earlier it was only drama and I’ll give my condition using the where Clause we’ll learn where clause in a bit so I’ll write where moviecore ID is equal to 103 so here our movie ID is the unique identifier so it will first look for movie ID 103 it will locate that movie and it change the genre to drama and crime so now you can see the difference earlier we had 12 Angry Men as drama as the movie genre now if I run this update statement okay you can see we have successfully updated one record now let me run the select statement again okay so here you can see if I scroll down there you go so movie ID 103 movie name 12 Angry Men we have successfully updated the genre as drama comma crime okay now let me tell you how you can delete records from a table so for that you can use the delete command so you’ll write delete from the table name that is movies where let’s say I want to delete the movie ID 108 which is The Lion King so I’ll write where moviecore ID is equal to 108 this is one of the ways to delete this particular movie or you can give let’s say where movie name is is equal to The Lion King let me select this and I’ll hit execute now if I run my select query again you see this time it has returned seven rows and you cannot find movie with movie ID 108 that was The Lion King so we have deleted it all right next we are going to learn about wear clause in post SQL so to learn we Clause I’ll be using the same movie table again let’s say we want to filter only those records for which the IMDB ratings of the movies is greater than 8.7 so this is my updated table now I want to display only those records or those movies whose IMDB ratings is greater than 8.7 so we’ll display 12 angry man which is 9 then we are going to display the Dark Knight which is again 9 and we are also going to display the sank Redemption which has 9.3 the rest of the movies have and am Tob rating less than 8.7 so we are not going to display those all right so let me show you how to write a we Clause so I’ll write select star from movies where I’ll give my column name that is IMDB ratings is greater than I’ll use the greater than symbol then I’ll pass my value that is 8.7 I’ll give a semicolon and let’s run it I’ll hit F5 there you go so we have returned the sank Redemption The Dark Knight and 12 Angry Men because only these movies had IMDB ratings greater than 8.7 okay now let’s see say you want to return only those movies which have IMDB ratings between 8.5 and 9 so for that I’m going to use another operator called between along with the wear Clause so let me show you how to use between with wear Clause I’ll write select star from movies where my IMDB uncore ratings is between I’ll write 8.5 I’ll give an and operator and 9.0 so all the movies that are between 8.5 and 9.0 ratings will be displayed so let’s select this and I’ll run it there you go so we have returned the darkno The Matrix the seven interal and we have the 12 Angry Men so a few of the course that we missed out where I think vertigo which has 8.3 and there’s one more all right now moving ahead let’s say you want to display the movies whose movie genre is action you can see in a table we have a few movies whose genre is action movie so you can do that as well I’ll write select star from movies where the movie J I’m writing this time in one line you can break it into two lines as well I’ll write moviecore Jer which is my column name equal to I’ll give within single codes action now why single code because action is a string hence we need to put it in single codes if I run this there you go so we had one movie in our table whose movie genre action that is The Dark Knight okay now you can also select particular columns from the table by specifying the column names now here in all the examples that we saw just now we are using star now star represents it will select all the columns in the table if you want to select specific columns in the table you can use the column names so you can specify the column names in the select statement let me show you let’s say you want to display the movie name and the movie genre from the table so you can write select moviecore name comma I’ll give the next column as moviecore Jer from my table name that is movies where let’s say the IMDB uncore ratings is less than 9.0 so this time in our result it will only show two columns that is movie name and movie JRE let me run it there you go so these are the movie names and the movie Jers you can see that have an IMDB ratings less than 9.0 all right like how you sh the between operator there is one more operator that you can use with the we Clause that is the in operator so the in operator works like a r clause or an R operator so let’s say I want to select all the columns from my movies table where the IMDB ratings is in 8 .7 or 9.0 if I run this it will display only those records whose IMDB ratings is 8.7 or 9.0 all right so up to now we have looked at how you can work on basic operations in SQL like your mathematical operations you saw how a select statement works we created a few tables then we inserted a few records to our tables we saw how you can delete a table from your database and we have performed a few operations like update delete and we saw how a wear Clause works now it’s time to load a employee CSV file or a CSV data set to post SQL so I’ll tell you how you can do that but first of all before loading or inserting the records we need to create an employee table so let me first go ahead and create a new table called employees in our SQL _ demo database so I’ll write create table my name of the table would be employees next I’m going to give my column names so my first column would be employee ID so the employee ID will be of type integer it is not going to contain any null values so I’ll write not null and I’ll give my constraint as primary key so the employee ID as you know is unique for all the employees in a company so once I write primary key it will ensure that there are no repetition in the employee IDs okay next I’ll have my employee name so my employee name is going to be of type varar and I’ll give my size as 40 okay next we’ll have the email address of the employee again email address would be of type varar and the size is 40 again I’ll give another comma this time we’ll have the gender of the employee gender is again worker of size let’s say 10 okay now let’s include a few more columns we’ll have the Department column so I’ll write department worker let’s say the size is 40 then let’s say we’ll have another column that is called address so the address column will have the country names of the employees address is also our car and finally we have the the salary of the employee salary I’m going to keep it as type real so real will ensure it will have decimal or floating Point values okay so now let me select this create table statement and execute it all right so we have successfully created our table if you want you can check by using select star from employees let me select this and I’ll hit execute all right you can see we have our employee ID as primary key there’s employee name email gender this department address and salary but we don’t have any records for each of these columns now it’s time for us to insert a few records to our employees table now to do that I’m going to use a CSV file so let me show you how the CSV file looks like okay so now I am on my Microsoft Excel sheet and on the top you can see this is my employe data. CSV file here we have the employee ID the employee name email gender this department address and salary now this data was generated using a simulator so this is not validated and you can see it has a few missing values so under email column you have a few employees who don’t have an email ID then you can see on Department also there are some missing values here as well all right so we’ll be importing this this table or the records present in this CSV file onto postr SQL all right so here in the left panel under tables let me right click and first refresh this there you go so initially we had only movies table and now we also have the employees table now what we need to do is I’ll right click again and here you see we have the option to import or export let me click on this and I don’t want to export I need to import so I’ll switch on import all right now it is asking me to give the file location so let me show you how to get the file location so this is my file location actually so my Excel file which was this is present in my e Drive under the data analytics folder I have another folder called postc SQL and within the post SQL folder I have my CSV file that is employe data. CSV so I’ll just select this you can either do it like this or you can browse and do okay now my format is CSP next I’m going to select my headers as yes and then let me go to columns and check if everything is fine all right so I have all my columns here let’s click on okay you can see I have a message here which says import undor export all right so here you can see successfully completed we can verify this by using select star from employees again if I run this all right let me close this there you go it says 150 rows affected which means we have inserted 150 rows of information to our employees table you can see we have the employee ID this are all unique we have the employee name the email we have the address and the salary let me scroll down so that okay you can see we have 150 rows of information that means we have 150 employes in our table okay now we are going to use this employees table and explore some Advanced SQL commands now there is an operator called distinct so see if I write select address from employees this is going to give me 150 address of all the employees there’s some problem here I did a spelling mistake there should be another D if I run this again I’ll query will return 150 rows you can see we have the different country names under address that is Russia we have France there United States we have Germany okay and I think we have Israel as well yeah now suppose you want to display only the unique address or the country names you can use the distinct keyword before the column name so if I write select distinct address from employee it will only display the unique country names present in the address column if I run this see it has return returned us six rows of information so we have Israel Russia Australia United States France and Germany all right now as I said there are a few null values which don’t have any information so you can use the isal operator in SQL to display all the null values that are there suppose I want to display all the employee names where the email ID has a null value so I’ll write select star from employees where email is null this is another way to use your wear Clause if I select and run this there you go so you see here for all these employee names there was no email ID present in the table so it has written us 16 rows of information so around 10% of employees do not have an email ID and if you see a few of them do not have an email ID and also they don’t have a department so if you want to know for those employees which do not have a department you can just replace where department is null instead of where email is null now if I select this okay it has returned us nine rows of information which means around 5% of employees do not have a Department moving ahead now let me show you how the order by Clause Works in SQL now the order buy is used to order your result in a particular format let’s say in ascending or descending order so the way to use is let’s say I want to select all the employ from my table so I’ll write select star from employees order by I want to order the employees based on their salary so I’ll write order by salary let me select and run it okay there is some problem I made a spelling mistake this should be employees let me run it again okay now if you mark the output a result has been ordered in ascending order so all the employees which have salary greater than $445,000 appear at the top and the employees with the highest salaries appear at the bottom so this has been ordered in ascending order which means your SQL or post SQL orders it in ascending order by default now let’s say you want to display the salaries in descending order so that all the top ranking employees in terms of salary appear at the top so you can use the dec keyword which means descending if I run this you can see the difference now so all the employees with the highest salary appear at the top while those with the lowest salaries appear at the bottom so this is how you can use an order by Clause okay so now I want to make a change in my existing table so here if you see under the address column we only have the country names so it would be better if we change the name of the address column to Country so I want to rename a column you can do this using the alter command in postc SQL so let me show you how to rename this column that is address so I’ll write alter table followed by the table name which is employees then I’m going to use rename column address I’ll write two I want to change it to Country if I give a semicolon and hit execute it will change my column name to Country now you can verify this if I run the select statement again there you go earlier it was address column and now we have successfully changed it to Country column okay let me come down now it’s time for us to explore a few more commands so this time I’m going to tell you how an and and an or operator Works in SQL so you can use the and and or operator along with the wear Clause so let’s say I want to s SE the employees who are from France and their salary is less than $80,000 so let me show you how to do it I’ll write select star from employees where I’m going to give two conditions so I’ll use the and clause or the and operator here I’ll write where country is equal to France now Mark here I’m not using address because because we just updated our table and changed the column name from address to Country so I’ll write country equal to France and my next condition would be my salary needs to be less than $80,000 I’ll go a semicolon let me run this all right so it has returned 19 rows of information you can see all my country names of France and the salary is less than $80,000 so this is how you can use or give multiple conditions in a we Clause using the and operator now let’s say you want to use the or operator and let’s say you want to know the employees who are from country Germany or the department should be sales so I’ll write select star from employees where country is equal to Germany and instead of and I’m going to use or their Department should be sales okay now let’s see the output I’ll hit F5 this time to run it all right so we have 23 rows of information now let me scroll to the right you can see either the country is Germany or the department is sales you see one of them in the table so here for the first record the country was Germany the second record the department was sales again sales again for the fourth record the country is Germany so this is how the or condition works so if one of the conditions are true it will return the result it need not be that both the conditions should satisfy now in post SQL there is another feature that is called limit so post SQL limit is an optional clause on the select statement now this is used as a constraint which will restrict the number of rows written by the query suppose you want to display the top five rows in a table you can use the limit operator suppose you want to skip the first five rows of information and then you want to display the next five you can do that using limit and offset so let’s explore how limit and offset works I’ll write select star from employees let’s say I’ll use my order by Clause I’ll write order by salary let’s say in descending and limit it to five this is going to display the top five employees which have the highest salary if I run this there you go you see it has given us five rows of information and these are the top five employes that have the highest salary okay so this is one method of or one way of using the limit Clause now in case you want to skip a number of rows before returning the result you can use offset Clause placed before the limit Clause so I’ll write select star from employees let’s say order by salary descending this time I’m going to use limit 5 and offset three so what this query will do is it will skip the first three rows and then it will print the next five rows if I run this there you go so this is how the result looks like okay now there is another clause which is called Fetch let me show you how that works I’ll copy my previous SQL query I’ll paste it here and here after descending I’m going to write fetch first three row only so my fetch is going to give me the first three rows from the top there you go it has given us the first three rows and you can see the top three employees that have the highest salary since we ordered it in descending order of salary all right you can also use the offset along with the fetch Clause I’ll copy this again and let me paste it here now after descending I’m going to write offset let’s say three rows and fetch first five rows only so what this SQL query is going to do is it will skip the first three rows of information and then it is going to display the next five rows it is going to work exactly the same as we saw for this query let me run it there you go so these are the first five rows of information after excluding the top three rows all right we have another operator that is called as like in post SQL so like is used to do pattern matching so suppose you have a table that has the employee names you forgot the full name of an employee but you remember the few initials so you can use the like operator to get an idea as to which employee name it is now let’s explore some examples to learn how the like operator Works in postris SQL so suppose you want to know the employees whose name starts with a so for that you can use the like operator let me show you how to do it so I want to display the employee name and let’s say I want to know their email IDs from the table name that is employee where since I want to know the employees whose name starts with a so I’ll write employee name like now to use the pattern is within single course I’ll write a and Then followed by percentage now this means the employee name should have an e in the beginning and percentage suggest it can have any other letter following a but in the beginning or the starting should be a if I run this so there is an error here the name of the table is employees and not employee let’s run this again there you go you can see there are 16 employees in our table whose name starts with a you can see this column employee name all of them have a letter A in the beginning okay now let me just copy this command or the query I’ll paste it here let’s say this time you want to know the employees whose name starts with s so instead of a I’ll write s so this means the starting letter should be S and followed by it can have any other letter if I run this so there are 10 employees in the table whose name starts with s okay let’s copy the query again and this time I want to know the employees whose name ends with d now the way to do it is instead of a percentage I’ll write this time percentage D which means at the beginning it can have any letter but the last letter in the string or in the name should be ending with D now let me copy and run this so there are 13 employees in the table whose name ends with a d you can see it here all right now let’s say you want to find the employees whose name contains ish or have ish in their names so the way to do is something like this so I’ll copy this now here instead of a percentage I’ll replace this with percentage ish percentage now this means that in the beginning it can have any letter and towards the end also it can have any letter but this ish should appear within the name let me run and show it to you okay so there is one employee who name contains ish you can see here there’s an ish in the last name of the employee all right now suppose you want to find the employee name which has U as the second letter it can have any letter in the beginning but the second letter of the employee name should have U now the way to do is I’ll copy this and instead of a% I’ll write underscore U followed by percent now this underscore you can think of a blank that can take any one letter so the beginning can start with a B C D or any of the 26 alphabets we have then then it should contain u as the second letter followed by any other letter or letters let me run this okay so there are 10 employees in the table whose name has a u as the second letter you can see these okay now moving ahead let me show you how you can use basic SQL functions or inbuild functions so we’ll explore a few mathematical functions now so let’s say you want to find the total sum of salary for all the employees so for that you can use the sum function that is available in SQL so I’ll write sum and inside the sum function I’ll give my column name that is salary from my table name that is employ let’s see the result this will return one unique value there you go now this is the total salary since the value is very large it has given in terms of e now one thing to note here is if you see the output the column says sum real so this output column is not really readable so SQL has a method which can fix is that is called an alas so since we are doing an operation of summing the salary column we can give an alas to this operation by using the as keyword so if I write sum of salary as let’s say total salary then this becomes my output column you can see the difference if I run this okay you can see now in the output we have the total salary now this is much more readable than the previous one so this is a feature in Excel where you can use or give alas names to your columns or your results now similarly let’s say you want to find the average of salary for all the employees now SQL has a function called AVG which calculates the mean or the average salary if I write AVG and I I can edit my alas name as well let’s see I’ll write mean salary let’s run it you can see the average salary for all the employees it’s around $81,000 okay now there are two more important functions that SQL provides us which is Max and minimum so if I write select maximum or Max which is the function name of salary as let’s say instead of total I’ll write maximum so this will return me the maximum salary of the employee let’s run it and see what is the maximum salary that is present in the salary column all right so we have 1ak 9,616 as highest salary of one of the employees similarly you can use the minan function as well I’ll just write minimum and this will return me the minimum salary of one of the employees in the table I’ll replace the alas name as minimum okay now run it this will give me the minimum salary that is present in our table so it is $4,680 okay now let’s say you you want to find the count of Department in the employees table you can use the count function so if I write select count let’s say I want to know the distinct Department names I can write inside the count function distinct Department as total departments from employees let’s run this this will return me the total number of departments that are there so it gives me there are 12 departments okay now let me show you one more thing here if I write select Department from employees let let’s run this okay so it has returned me 150 rows of information but what I’m going to do is I’ll place my distinct keyword here just before the column name so that I can verify how many departments are there in total there you go so there are 13 departments and one of them is null so moving ahead we’ll replace this null with a department Name by updating a table okay so now let’s update our department column so what we are going to do is wherever the department has a null value we are going to assign a new Department called analytics so earlier we have also learned how to use the update command so I’m going to show it again so we’ll write update followed by the table name that is employees I’m going to set my column that is Department equal to within single codes my name of the department would be analytics where department is I’ll say null so wherever the department has a null value we replace those information with Department that is analytics let’s run this you can see quy returned successfully now let’s say I’ll run this command again and this time you can see the difference there you go so we have 13 rows of information and there is no null department now we have added a new department that is analytics okay now we are going to explore two more crucial commands or Clauses in SQL that is Group by and having so let’s learn how Group by Clause Works in post SQL so the group by statement groups rows that have the same values into summary rows for example you can find the average salary of employees in each country or city or department so the group by Clause is used in collaboration with the select statement to arrange identical data into groups so suppose you want to find the average salary of the employees based on countries you can use the group by Clause so let me show you how to do it I’ll write select I want the countries and the average salary for each country so I’ll use the average function that is AVG and inside the function I’ll pass my column that is salary I’ll give an alas name as let’s say average uncore salary from my table name that is employees next I’m going to use my group by Clause so I’ll write Group by since I want to find the average salary for each country so I’ll write Group by country name let’s give a semicolon and let me run it I’ll use F5 there you go so here on the left you can see the country names we have Israel Russia Australia United States France and Germany and on the right the second column you can see the average salary for each of these countries now you can also order the result in whichever way you want suppose you want to arrange the results based on the average salary so you can use the order by Clause after the group by Clause so I’ll write order by here you can use the alas name that is average salary this is actually average uncore salary and let’s say I want to arrange it in descending order so I’ll write DSC now let’s run this you can mark the difference in the average salary column there you go so as per our result in United States the average salary is the highest and if I scroll down the average salary is the lowest in Germany now let’s see one more example using group buy suppose this time you want to find the maximum salary of male and female employees you can do that too so let me show you how to do it so I’ll write select this time we want to find the maximum salary based on gender so I’ll select my gender column comma and this time I’ll use my Max function since I want to find the maximum salary for male and female employees I’ll give an alas name as Max maximum underscore salary from my table that is employees Group by I’ll write gender okay so let’s run this there you go you can see so one of the female employees had a highest salary of $1 lak1 19,618 while of that of a me was $ 17,6 54 all right now suppose you want to find the count of employees based on each country you can use the count function along with the group by Clause so I’ll write the select statement select since I want to count the employees based on each country so I’ll first select my country column and then I’m going to use the count function I’ll write count e _ ID from my table name that is employees I’m going to group it by country so this query will give me the total number of employees from each country you can see here Israel there are four employees in Australia there are four employees in Russia we have 80 employees in France there were 31 in United States we have 2 7 so on and so forth now let me scroll down okay now it’s time to explore one more Clause a very important Clause that is used in post SQL that is having so the having Clause works like the wear Clause the difference is that wear Clause cannot be used with aggregate functions the having Clause is used with the group by Clause to return those rows that meet a condition so suppose you want to find the countries in which the average salary is greater than $80,000 so you can use the group by clause and the having Clause to get the result so I’ll write my select statement as select country comma I want the average salary so I’ll write AVG of salary I can give an alas name as average salary from employees now I’m going to group it by each country so Group by country colum since I want to find the countries in which the average salary is greater than 80,000 so I’ll use having Clause after the group by Clause I’ll write having average of salary is greater than 880,000 now this condition cannot be specified in the wear Clause so we need a having Clause you cannot use aggregate functions along with wear Clause let me just run it now there you go so we have Russia and United States where the average salary is greater than $80,000 all right now let’s say you want to find the count of employees in each country where there are less than 30 employees so for this I’m going to use the account function first let me select the country column then I’m going to use the count function and in the count function I’m going to pass my employee ID so that we can count the number of employees from my table that is employees now if you want you can use an alias name for this as well but I’m just skipping it for the time being I’ll write Group by country next I’ll write having count of employee ID less than 30 so this will return me the countries in which there are less than 30 employees let’s run it you can see here Israel Australia United States and Germany are the countries in which there are less than 30 employees okay now if you want you can use the order by Clause as well so suppose I’ll write here order by count of employee ID so what this will do is it will arrange my result in ascending order of employe ID count there you can see we have successfully arranged our result in s order of employee IDs okay next we are going to explore one more feature of post SQL that is of using a case statement now in post SQL the case expression is same as IFL statement in any other programming language it allows you to add if else logic to the query to form a powerful query now let me just scroll down and I’ll show you how to use a case statement this is very similar to your eel statement that you use on Excel in C++ in Python and or any other programming language so what I’m going to do is I’m going to write a SQL query that will create a new column and the name of the column would be let’s say salary range so I’m going to divide my salary suppose if the salary is greater than $45,000 and if it’s less than $555,000 in the new column that is salary range we are going to assign a value low salary now if the salary is greater than $55,000 and if it is less than $80,000 we are going to assign a value that is medium salary if the salary is greater than $80,000 we’ll assign a value High salary so all this we are going to do using our case expression in post SQL so I’ll start with my select statement but before that let me show you how to write a comment in post SQL so you can write a comment by giving a Double Dash comments are very helpful because they make your codes or the scripts readable I’ll write case expression in postc SQL similarly if you want you can go to the top and let’s say here you can write with Double Dash having clause okay let’s come down so I’ll write my select statement as select I want the department the country and the salary column I’ll give a comma and I’ll start with my case statement I’ll write case when my salary is greater than 45,000 and my salary is less than 55,000 then the result would be within single codes I’ll write low salary so this is exactly like an if else condition next I’ll write another case when salary is greater than 55,000 and salary is less than let’s say 80,000 then the result would be medium salary and finally I’ll give my last condition that is when salary is greater than 80,000 then the result will be high salary let me write this in a single line then High salary now one thing to remember in postris SQL the codes are insensitive so you can write your select statement in capital in lower case or in sentence case similarly I can write case as small C or you can write as Capital C all right now moving ahead after this I’m going to write end I’ll give an alas name as salary range now this is going to be my new column in the output let me just come down after this we need to give our table name from employees I’ll order it by salary descending okay so what I’m going to do here is I’ll first select Department country and salary column from my employes table and then I’m creating a new column that is salary range and I’m specifying the range so I have three conditions here for low salary for medium salary and high salary so so let’s run this and see the output there you go here you can see we have added a new column known as salary range and we have order our salary in descending order so all the highest salaries appear at the Top If I just scroll down you can see we have medium salaries here and if I scroll down further you can see this low salaries so case statements are really useful when you want to create a new column based on some conditions in the existing table all right now moving ahead we are now going to see how to write subqueries in post SQL so subqueries we write a query inside another query which is also known as nested query so suppose we want to find the employee name Department country and salary of those employees whose salary is greater than the average salary so in such cases you can use subqueries but let me show you how to write a query inside another query first I’ll write the select statement I’m going to select the employee name comma I want the department comma also want to display the country name and the salary from the employees table where my salary should be greater than the average salary so after this bear salary greater than I’m going to use brackets and write my subquery that is Select average salary from employees now let me break it down for you so first we are going to select the average salary from the employees so this particular SQL statement we’ll find the average salary from the table we’ll compare this average salary with salaries of all the employees so whichever employee has the salary greater than the average salary will display their names the department country and their original salary so if you want you can run this statement as well let me select this statement and run it for you you can see we have return the average salary of all the employees which is nearly $81,400 $6 so we want the salaries of the employees to be greater than this average value so let me run this and see how many employees have a salary greater than the average salary there you go so we have around 75 employees whose average salary or whose salary is greater than the average salary all right now moving ahead this time I’m going to tell you how to use some inbuilt functions we learn some inbuilt mathematical functions and string functions that are available in postris SQL so I’ll just give a comment there’s another way to write a comment instead of a Double Dash you can use the forward slash an asteris and inside the asteris you can write let’s say SQL functions and you need to close this so I’ll give another asteris and a forward slash so this is also a comment in postris SQL all right so first of all we’ll explore a few math functions so there is a function called ABS which is used to find the absolute of a value so if I write select abs of let’s say Min – 100 it is going to return me positive 100 or just 100 because as you know the absolute of any value will remove the negative sign involved in that value there you go so our original input was – 100 the absolute of- 100 is + 100 next let’s see another function that is called greatest so the greatest function in post SQL will return the greatest number in a range of numbers so suppose I write select greatest inside the greatest function I’ll pass in a few few numbers let’s say two I’m just randomly passing a few numbers let’s say 4 90 let’s say 56.5 and let’s say 70 I’ll give a semicolon let me run this you will see the greatest function will return the greatest integer value or greatest number that is present in the range of numbers that we have provided so in this case 90 was the largest number or the greatest numbers so we got the result as 90 again you can use an alas for each of these statements now like greatest we also have a function called least which is going to return the least number present in a range of numbers if I run this so the result is two because two is the least number that is present in this selection all right now there’s a function called mod which is going to return the remainder of a division so suppose I write select mod and this takes two parameters let’s say 54 ided 10 as you can guess the remainder is 4 and so is our result you can see it has return the remainder 54 divided by 10 the remainder is 4 all right if I scroll down now let’s see how to use the power function so I’ll write select power let’s say I want to know power 2 comma 3 which is 2 Cube that is 8 let me just run this there you go so the result is 8 you can also check let’s say power of 5 comma 3 it should be 125 all right next you can use the sqrt function that is available in post SQL to find the square root of a number I’ll write sqrt and let’s say I want to find the square root of 100 you can guess the result the output should be 10 if I run this you can see the output here 10 let’s say I want to find the square root of let’s say 144 you can again guess the result it should be 12 let’s verify it okay there is some error let me verify it again there you go it is 12 now there are a few trigonometric functions as well you can use the S function the COS function and the tan function let’s say I want to know the sign of 0 if you have studied High School mathematics you would know the sign of 0 is 0 you can see the result it is0 let’s say you want to know s 90 if I run it you can see the output here 89 all right now there are other functions like ceiling and floor that you can use so let me show you what the ceiling and floor function does I’ll write ceiling let’s say I’ll pass my floating value as 6.45 and let me run it you can see the ceiling function Returns the next highest integer that is 7 in this case since the next highest integer after 6.45 is 7 let’s see what the floor function does and let me run it as you can see the floor function Returns the next lowest integer that is six in this case or the nearest lowest integer to any provided decimal value okay now that we saw how to use mathematical functions there are a few string functions available in postris SQL so let’s explore them as well I’ll write string functions okay scroll down cool there’s a function called character length that gives you the length of a text string suppose I write select give the function as character length and inside this function I’m going to pass in a text let’s say India is a democracy this is my text let me run this okay you can see the result here which is 20 since there are 20 characters in my string that I have provided all right now there’s another function called concat in post SQL so concat is basically used to merge or combine multiple strings so I’ll write select concat within brackets I’ll give the text string now let’s say I want to combine post Crest SQL I’ll give a speed comma I want to merge post SQL is I’ll give another comma and write my final word that is interesting now what we have done is inside the concat function we have passed in separate strings and now using the concard function we want to merge the three strings let’s see what what the result is I’ll run it all right let me just expand this you can see here we have Conca inated the three strings successfully so the output is post SQL is interesting okay now there are functions like left right and mid in post SQL so what the left function does is it will extract the number of characters that you specify from the left of a string let’s say I’ll write select left and I’ll pass in my text string as India a democracy I’ll copy this and I’ll paste it here let’s say I want to extract the first five characters from my string so I’ll give five so what it will do is it will count five characters from left so 1 2 3 4 and 5 if I run this it should ideally print in for me there you go it has printed India for us all right similarly you can use the right function to extract few characters from the right of a string let’s say you want to extract let’s say I’ll give 12 characters from right so from here onwards it will count 12 characters I’ll change left to right now let me select this and run it so you can see here this is the output from the right it has counted 12 characters and returned a democracy okay now there is a function called repeat so the repeat function is going to repeat a particular string the number of times you specify let’s say I want to select and use my repeat function and inside the repeat function I’m going to pass in let’s say India and I want India to be displayed five times I’ll give a semicolon and run it in the output you can see India has been printed five times okay let’s scroll down there is another function a string function in post SQL called as reverse so what reverse function is going to do is it is going to print any string passed as an input in reverse order so if I write select reverse and inside the reverse function I’ll pass in my string that is India is a democracy I’m going to use the same string I’ll copy this and I’ll paste it here I close the codes and the brackets let’s print this you can see it here India is a democracy has been printed in reverse order there you go all right now this time we explored a few inbuilt functions that are already present in post SQL now post SQL also has the feature where you can write your own user defined functions so now we will learn how to write a function of Our Own in postris SQL so let’s create a function to count the total number of email IDs that are present in our employees table so for this we’ll write a function a user defined function so let me give my comment as user defined function okay so let me start by first writing create so this is the syntax to write a function in post SQL so I’ll write create or replace function then I’ll give my function name as count emails and as you know functions have brackets then I’ll write Returns the return type as integer then an alas with dollar symbol I’ll write total emails since I’m going to display the total number of email IDs that are present in my table I’ll close the dollar symbol then I’m going to declare a variable the variable name is going to be total underscore emails this is of type integer I’ll write big and inside begin I’ll write my select statement so I’ll write select I want to count the email IDs that are present so I’ll pass my column name that is email into total emails from my table name that is employees I’ll give a semicolon and then we’ll write return total emails as you know user defined functions often return a value so hence we have mentioned the return statement as well and now I’m going to end my function then the next syntax would be let me just scroll down Okay so here I’ll give my dollar symbol again followed by total underscore emails next I’ll write my language post SQL so the way to mention is PL p g SQL let’s give a semicolon and end it so this is my user defined function function that I have written so I created a function with the function name countor emails and this would return integer as an alas which is total _ emails we declared that variable as an integer then we started with our begin statement that has my select statement where I’m selecting the count of email IDs that are present in the employees table and I’m am putting the value into total _ email so I’ve have used the into keyword and this Returns the result as total _ emails and I have ended let’s run this okay there is some problem there is an typo so this should be integer okay let me run it once again there you go so you’ve successfully created a userdefined function now the final step is to call that function now to call this function I’m going to use my select statement and the function name that is countor emails I’ll give a semicolon let’s execute this there you go so here you can see there are 134 email IDs present in our employees table now one thing to Mark is there are total 150 employees in the table but out of them 134 employees have email IDs the rest of them don’t have so they would ID have null values all right so that brings us to the end of this demo session on postris SQL tutorial let me go to the top we have explored a lot so we started with checking the version of post SQL then we saw how to perform basic mathematical operation that is to add subtract multiply then we saw how to create a table that was movies we inserted a few records to our movies table then we used our select Clause we updated a few values then we deleted one row of information then we learned how to use the wear Clause we learned how to use the between operator we also learned how to use the in Operator Let Me scroll down we created a table called employees and then we learned how the distinct keyword works we also learned how to use isnull with wear Clause we learned about the order by Clause we saw how to alter or rename a column then we explored a few more examples on WE Clause where we learned about and and or operator then we learned how to use limit and offset as well as the fetch operator or the fetch keyword in postr SQL moving further we learned about the like operator in SQL which was used to perform pattern recognition or pattern matching you can say here we saw how to use basic inbuilt post SQL functions like sum average minimum count maximum next we saw how to update a value in a column using post SQL update command we Lo learned how to use Group by then we learned how to use having Clause then we learned how to use case expressions in postc SQL so we saw how case expression is similar to our ifls in any other programming language we explored a few mathematical and string functions and finally we wrote Our Own userdefined function so that brings us to the end of this tutorial on postris SQL in this session we will learn about how to join three or more tables in SQL that’s right so so far we have a fundamental understanding of how to join two tables but in a few situations you might have to extract the data by joining three or more tables right so that’s exactly what we are going to discuss today now without further delay let’s get started now we will jump into the MySQL workbench where we have our query ready so here we will be using three tables employe details employ employe register and employee joining register so we want employee name contact number and joining date joining date is available in joining register contact number is available in employee register and employee name is available in in employee details right so here we are utilizing all three tables and joining them to extract these three columns so here I’m providing the table name and the column name to make sure that the SQL workbench will not get confused which employee name I mean which employee name column should I access which table should I use so to clear that confusion I’m providing the table name so my SQL will use employee details table and from there it will extract the employee name column and contact number to be sure it is present in only one table so it will go to employee register and joining date it is also present in only one table so no need to specify the table name name but to be on safer side if you want to add you can add that’s well and good and it will create an impact to your interviewer considering that you know about the syntaxes so I’ll just follow the normal syntax the way I’m going now and now I’m trying to join it’s the same operation what you use to join two different tables so you’ll just join use the join clause and give the second table name and on what basis so I’m joining these two tables on the primary key employee ID so both tables have the same common employee ID So based on that I’m joining these two tables and then you will use the join Clause once again and give the third table name here and based on which primary key or which common key so this is the one which is common in both the tables and is unique which is the employee ID so I’m using the same one to join the two tables with the third one you can follow the same syntax and join another table to it as well so with this let’s execute this particular query to find our answer so there you go you got the query executed and you have the table over here with employee name contact and joining date let’s quickly get back to my SQL workbench so here let’s say you wanted to create a table and load the data so if the data is minimalistic maybe the data about 10 rows or 15 rows you can manually create a table and insert the elements into your table using such insert commands but this set like mine which is in the Excel spreadsheet and and has about 10,000 rows would you like to you know write 10,000 insert commands no right it would be really timec consuming so for situations like this MySQL workbench has enabled developers to load the data from spreadsheets within a few steps but before we get started let’s check our column headers so here you can see uh it is not compatible with the MySQL workbench or SQL commands right so it has a space and it has a word row and it is a keyword in my SQL right so we don’t want that confusion so for that reason we will try to modify this maybe using an underscore right similarly for the row ID you can eliminate space and use an underscore and when it comes to order ID the same so just let us quickly change all the column headers so that there are SQL compartible there you go so we have replaced all the headers with underscores and made them SQL compatible so let us save it and when you’re saving it just make sure that your Excel file data is also having the uh you know SQL compatible uh name so here we have load CSV to my SQL so here we have the name with space right so let’s try to change that to lowercase Excel data and like right now it is is SQL compatible Now quickly let’s go back to my SQL workbench here we have it now you can just create a new table just right click here or you can also do it from here create a new database and the new schema will be the name of your data set so let’s type it as Excel data or just Excel and apply now this is the schema apply now you have Excel right here just drop down on here you can see the tables right click the tables and here you have the option of uh import data Vis it right just click on that now browse your folders and have it so yeah another notification for you guys so you need to save your Excel spreadsheet in the form of comma separated file right let’s quickly go back and do that open your spreadsheet go to file see as comma separated file save there you go now let’s get back to workbench and now I think you will be able to find it just open next drop table of access make sure you do that to be on the safest side and check all the names here so we have a problem with row ID but that is something which you can fix down the line and before you go to the next step check all the other names as well every other name on the column header is fine just the first one we can just alter the table not a big deal and uh next it should start importing there you go the data file got imported of course it took a little while because it’s 10,000 that’s normal now let’s quickly go to the next step and here you can see 9,987 records imported successfully just click on finish and I think it should be shortly done let’s close the schema go to the query table and and yeah so you can quickly refresh so that you will have the Excel dat over here now let’s use the database that is Excel on it now we are in Excel database now the table we’re looking for is Select star from the data name the data table name is Excel data without a space semicolon just quickly run it and I think we should be shortly able to see all the data sets right over here yeah about the First Column ID we can simply use the alter table um function or query to change the name let’s quickly do that with the table name alter table rename okay what was this can you copy that copy field name rename row ID to so um a small see syntax ER there so I think they should be sorted yeah and now let’s quickly run this and I guess it should be done now let’s quickly run the select command again there you go so we have the row ID order ID uh aut dates ship dates ship in mode customer ID customer name Etc and everything is as per the expectations and that’s how you can load Excel data to mySQL workbench we will learn about the top five interview questions in SQL that you must know to crack your business analytics interviews now without further Ado let’s get started so speaking about the tough five interview questions let’s quickly jump to the workbench of SQL so I’m using my SQL workbench and here we have a database called use simply learn so SLP is the database name so we will be using the use command to get access to that particular database so we are in access to it and now let’s quickly check out tables show tables there you go we have a few tables here book collection book order employee details joining register and employee register so let’s go with the employee details so I’ll just simply shoot a command select star from EMP details so now we have our employee table so we will be using this particular data set to run few queries from our interview questions so getting back to the interview questions so most commonly you will be asked the following interview questions so the first question is you will be asked to find out the names of the employees that start with the vbls so AEI w u so they will ask a question give me the list of names that start with these five letters it can be either a or e or I O U right so what’s the question so here you will be using like operators and not like operators let’s say they want the names with vels so you can just use the like operator and use this particular command select employee name that’s the name of the column from employee register or employee details and then where employee name is like a modulus which means it should start with a and it can have any number of alphabets after that right so let’s quickly also check what do we have in employee register so I think it’s a similar table to employ details that we used before we have the same details here so no not our problem now let’s try to extract the names of the employees W names start with WS right so we will be using the we clause and like operators run the command and there you go so we have three names so if if you able to answer this question they will ask they might ask a similar question with a little modification so this time they might ask you give me the list of names that will not start with vels so you’ll just replace the like operator with not like so either they’ll ask the questions which might ask start with vels or they might ask a question which does not start with vels this is one of the common questions now going to the second question so here also they will give it a Simple Start they will ask to give the details of the employe who has the highest salary or they will ask you to give the highest salary you can simply use the max aggregation and you can get the maximum salary so we have $887,000 and sometimes if you are able to answer this question they will make it a little tricky to you and they will ask give the second highest salary here you can use offset or there are multiple possibilities but let me give you the simplest one where you can have the same query only difference is where salary is less than Max of salary first the subquery will be executed which will extract the maximum salary and the next one is less than so it’ll give you only one R which has a little salary which is Les less than the maximum one right so simply let’s execute the query so we will have the answer so the next highest salary is $78,000 now moving on to the third question so sometimes they will ask you to use the update commands as well so here we have some salary details of our employees in the employee register employee going to do here is let’s say this is the appraisal period and they’re giving you 15% hike to all the employees so you need to update the salary column so what you do is simply uh update table name salary plus salary into the percentage of hike which is 15% so here we adding that particular percentage in the form of decimal numbers which is 0.15 and simply run the command and you have it now you can just simply uh query the same detail which is Select Staff from employee details so you will have the updated salary list here there you go now let’s proceed with the next question that we have in our list which is about select the employee name and salary from a given range right if they ask you give me only the range of employees who ow salary lies between 50,000 to 70,000 so you can use the between operator here and range of numbers that is 50,000 and 70,000 just run this particular code and there you go you have the Dil so there are two employees whose salary lies between 50,000 and 70,000 now the last question is uh they might ask you to extract the details from a certain department so basically this might also turn up to find a difference between having and group by Clause sorry having and wear Clause yeah so this is one of the common interview questions where they will ask the difference between having and wear Clause so when you are implementing Group by in your quer query and you’re also implementing some aggregate functions like count sum Etc minimum maximum in those situations when a group by Claus is involved then you can use having when there is no uh Group by function then you can simply go with the wear Clause so here I’m trying to extract the number of people present in finance department so I’m not grouping by department so I can just use where clause and run this and now in situations where I have to group by I mean you know when I have to implement Group by in those scenarios you can include having clause in place of well so that is the fundamental difference between having and group by and also you got the understanding of group by command here imagine walking into a giant Library this isn’t just any Library it’s huge there are rows and rows of shelf each packed with thousands of books but wait you don’t need to read every book here you’re just looking for specific ones like books about space or stories about superheroes but finding what you need is such a huge uge Library that’s going to be tricky this massive library is like a database a database is a huge collection of information stored neatly ready to be used it holds everything name address grades prices whatever data you can think of but sometimes all that information can be overwhelming you don’t want to shift through everything every time you need something specific right that’s where views comes in in this video we will explore views in Sequel explaining what what they are and how they simplify working with databases we’ll also cover how to create views manage them by updating deleting and listing them and also introduce different types of views like simple complex read only and those with check option we’ll also dive into materialized views which store data for faster queries and by the end you will understand how views can manage data easier and more efficient we will also look into a quiz question to clarify your understanding so what exactly is a view let’s go back to our Library example imagine if you had a Magic Window a special one that only shows you the books about space or superheroes that you are interested in you don’t have to wander through the entire Library anymore you just look through your magic window and it gives you exactly what you need that Magic Window is what a view is in the world of databases a view is a special virtual window into a data that shows you only what you need to see and the best part is it’s not actually stor any new data it’s giving you a filtered look into a huge database think of view as a shortcut making your life a whole lot easier so let us get into the demo part about how to creating a view and the types of views in SQL so let’s start with the demo part that is how do we create a table in my SQL so here as you can see uh I’ve just logged into an online compiler and now we will just learn how we create a table and then we’ll move on to creating views and the types of views so first to create a table just enter this command just WR over here create create table sorry and this just enter the table name it could be something like student details and here you can give the U student ID name we want a student ID name in the first row so we’ll just keep s ID and and the type of the variable is integers we’ll just mention in here and since it’s a primary key you just mention it over your primary key

    next variable can be something like name and for this the type is bar care you can just enter any number here suppose I’ve enter 255 I’m sorry comma next mention the address again it’s Vare address so as you can see we have created this table with student ID name so we have to give you underscore student ID name with primary key inte teer type and the name and the address over here so after we have created this table we want to insert data into the table this is our next step step so to insert data what could be the basic uh command which you can write can be something like insert into student the table name student details over here and just mention what all uh columns which we have attributes we have created which is uh S uncore ID next is name and then we have address over here and then we mention the values so now that we have created table and then we can enter our details over here so these are the values which I have inserted which is hsh Ashish prati tanra sham okay so like this you can enter the values over here and now if I want my records to be displayed here the command which I’m using is Select star from student details so all these the table will be shown here so we’ll just click on this run button over here okay so now you can see this is the Sid name number and the names which we have mentioned over here so as you can see this is the output generated this is the Sid the name and here is the address so as you can see the table is created now and the output is also shown now we will move on to the main step which is creating a view so what exactly is a view a view is like a window that lets you see specific data as I’ve already told you in the intro part so now let’s say we only care about students with S ID less than five okay and instead of running the same query every time we want to create a view so here’s how you can do it you can just simply enter the command so here’s how you can create a view so just mention this U command which is create view detail view as select name comma address from student details so student details is the table here which we have already created before and I want the student ID less than five okay so I want this to be shown here so we’ll just simply click on run over here so as you can see this is the error it’s showing why is it showing because line number 203 view must be the first statement in a query batch so to resolve this we have to ensure that the SQL batch is properly separated by go command and the view creation is syntactically correct all right so we haven’t used the go command here to do that we just simply after we have inserted our records in the table here you just type goo okay and so this is the end of the first batch and now since we have have created a view in a new batch we’ll just mention this uh whatever command we have already given and at the end of the second batch we’ll again write Cod okay and now next I want my query to be shown here so I’ll just write this command for my for generating my output which is Select select star from and the table name okay not that uh student details table name name I want this the view table name to be show which is detail view right this is the table name so just copy it from here just paste it give this semicolon so now you can see a view is created so why have we used go because go ensures that to create view command is executed in a new batch and this will help the SQL Server properly separate commands and avoid conflict all right right and we have also used the select star from details view which will fetch the data from The View and includes only student with student ID less than five which is danj pratik Ashish and hsh okay name and the address is displayed over here so this is our output generated now let’s talk about managing views and updating the view let’s say later on you want to update the view to also include the students age instead of deleting or recreating the view you can use create or replace view to update it so let’s add the age column to the table now so we have to just insert the data for the students here so to add a new column age to the students detail table we need to use this command which is Alter table and then provide the name of the table which is student details add age and the type is integer and give the semicolon over here all right and now we have to insert the age data for the student so to do this we will use the update command and here we will just write this update okay name of the table which is student and here we use set age is equal to 19 where S ID is equal to two no sorry with the S ID is equal to one so in the similar way you have to just update all the table over here so after updating it you can just simply search run here so like this you will just update the student details and set the age accordingly and now the next step is after updating it we have as you can see we have just entered all the age number which we want to be displayed here and we have used the insert to command to insert records into the table with the age values and now in order to select it we will uh give the command which is Select star from and the name of the table which is student details and now we want to end the P we have to use this go command enter so now as you can see we have updated and inserted all the age data for the students and we have selected all the data from student details to display the final result by giving this command which is Select Staff from student details and this is the name shown over here and here is the address and this is the age so we have inserted all the age data and we have also corrected the update statement set age is equal to 19 where Sid is equal to 1 and do not forget to add this go okay now the next thing which we will be talking about is deleting a view in order to delete a view you just simply have to uh give this command which is drop view if exists and then give the name I mean the table name which is details view right and just simply you can uh go to this just run and you can see that our table has been dropped okay so now we have deleted our table just by giving this command now so and do not forget that this command will delete the view but don’t worry the data in the original student details table will not be affected only the view table which we have created is deleted and now next we’ll be talking about listing all the views okay so now in order to list all the views just you have to Simply write this command which is show full tables where table type is equals to v i view so by doing this thing uh this output will give you all the views which we have created in this table the name student ID okay and the uh addresses and also the age so it will give you all the table view table which we have created now let’s move on to the main part which is uh the types of views in SQL now let us first understand what is a simple view so simple View view is Created from a single table it’s straightforward and it doesn’t involve complex Logics like joints or subqueries for example to create a simple table you just simply have to write this command which is create view student names as and then select name from student details okay so if you query this View and then write select star from student names okay so after doing this thing you just simply click on run so after we have written this view simple view which is create view student name as select name from student details and do not forget again to mention this go and then select query the simple view to display student names and Select Staff and student names all right and at last again add this go and here this is the output generated the names all right so by using simple query you can do this thing now let’s move on to the second card which is creating a complex view a complex view involves multiple tables or complex logic let’s say we have another table who has student marks that that stores student marks so let’s say so let’s say we have created this table student marks and uh we have given the details here and again we have inserted the data values student ID marks 1 93 these are the values which we have inserted this is the student ID and these are the marks which we have given all right so now let’s create a view that pulls data from both student details and student marks so now as you can see we have created a complex view by providing all these details and this is the student ID and the marks shown over here all right so now let’s move on to the third part which is readon view so a readon view ensures that no one can modify the data through the view this is useful when you want users to able to see the data but not change it to make the view read only you can use permissions in your SQL databases and this feature depends on on your database engine to create a readon view and SQL Server we cannot directly enforce read only Behavior with the create view statement however you can control access to the View using permissions so here’s you can how you can do it uh so first you can create the view normally you can use revoke insert update and also delete permissions from users for that particular view uring that they can only read the data so now uh this was all for the readon view let’s move on to the fourth type of the view which we are discussing today which is the check option so with the check option is seel which is with the check option I’ll just type here with check option okay this is the fourth type of view which we are talking about with check option and cq4 it ensures that any insert or update operation performed through a view complies with the conditions specified in the where laws of the view this means that you cannot insert or update records through the view that violate the condition of the view itself so let’s go through the creation of a view with the WID check option and provide an example with an explanation and expected output so now as you can see we have the expected output for the valid insertion uh using this command which is with check option and with view creation check option we created a view named a sample view as you can see here that selects student ID and name from the student details table but only where the name is not null all right so as you can see we have clearly mentioned here not null and the width check option ensures that any insert or update through the view must comply with a condition where name is not n and the first insert into sample view inserts so student where stent id6 name which also has a valid name and here is the output generated here all right this is the student ID the name over here and that’s it so now we have learned about this width check option as well now I’ll be talking about a materialized view so what exactly is a materialized view well a materialized view is different from a regular view because it stores the actual data in a database meaning the data is precomputed and doesn’t need to be fetched from tables every time you query it this makes accessing data from a materialized view much faster especially for complex query all right so as you can see we have created the table and we create three tables the order details table and the product details and also the customer details table over here and then we have provide the necessary data also the Second Step was the sample data is inserted into all the three tables the materialized view fast order summary is created to summarize orders joining data from the three tables and Computing the total cost and the next step is the materialized view is queried and because the data is precomputed it returns results very quickly with new data when added to the audit DS table the materialized view is refreshed using refreshed materialized view to include the new data all right I hope you get it why are we using the refreshed option and the next step is the materialized view is deleted using the job materialized table over here we have used this job materialized view fast order summary and in this way you can you know uh create a materialized view so now you might be wondering that what is the difference between the materialized view and the complex View and the simple view I’ll be seeing you that later so first let us now discuss why are views so useful okay so uh why are views so useful by now you might be wondering what’s the big deal with the views and here’s why it’s so useful they make our life easy you don’t have to keep writing complex queries over and over you create a view once and it saves you tons of time we also help you in simplifying data instead of pulling everything from your database use can help you narrow down exactly what you need you also help in improving security want to show only certain parts of the data to certain people use views to control what others can see without letting them touch the raw data you can also rename columns for clarity you can rename confusing columns in the view without changing the original table making it easier for users to understand the data so so let us not discuss the differences between the simple complex and the materialized view so here’s a quick summary of the differences so simple view PS data from a single table no complex logic or joints involved and it doesn’t store data it also the performance is it executes every query every time whereas the complex view it combines data from multiple tables using joints Aggregates or other complex logic and it doesn’t store data executes query every time and talking about a materialized view it stores the result of a query making it faster to retrieve data without running the query again it also stores data and it is much faster as it uses pre computed data so now it’s time for the quiz here’s a quiz question for you what is the key difference between a regular View and a materialized view in SQL the first option is regular view store data but materialized views don’t second option is materialized view stores data but regular views don’t C both regular and materialized view stores data and number D is neither of them store data so if you want to answer them you can just write them in the comment section below and that’s it views are an incredible tool in SQL that can simplify your queries improve security and make your life easier so SQL Server is a powerful relational database management system developed by the Microsoft which is widely used for managing and storing the data its benefits include High scalability robust security features and seamless integration with other Microsoft tools tools and Technologies SQL Server provides efficient data management through advanced features like indexing full text search and inmemory processing it also offers excellent support for large D sets making it ideal for Enterprise applications the built-in business intelligence tools help organization gain valuable insights from their data SQL service High availability and Disaster Recovery features ensure continuous operations with minimal downtime with strong data integrity and transactional support it ensures reliable and consistent data management across all applications that said if these are the type of videos you’d like to watch then hit that like And subscribe buttons and the bell icon to get notified so in this session for today which is SQL over tutorial we will cover the SQL Basics that is how to create a table how to insert data how to retrieve the data from the tables Etc and apart from that we will also go through some of the other fundamentals of SQL basics which include sorting in SQL server and followed by that we will also go through the group bu and OD by sequences in SQL Server next ahead we will also learn another important part which is conditional statements which includes case statements in SQL proceeding ahead we will get into another segment of today’s session which is about joints and SQL where we will be combining two or more tables in SQL Server followed by that we have the next part of today’s session which is all about the having clause in SQL Server next we will proceed with learning the next part which is about the between operator in SQL server and followed by that we will get ahead with pattern matching in SQL Server next we will cover the time and date functions available in SQL server and after that we will proceed with temp which is temporary tables and SQL server and proceeding ahead we have the most important part which is about the Common Table expressions in SQL server and followed by that we have the last part in SQL Server tutorial which is about creating views and executing a query to extract the data present in a view in SQL so far so good so these are the foundational skills in SQL Server that you need to get before becoming a pro in SQL Server so this particular tutorial will discuss the major Foundation skills the fundamental skills the basics of SE server and its operations now without further delay let’s get started with one of the compilers which can help us execute the SQL Server queries so we are on one of the SQL Server compilers available online in case if you are facing any difficulty setting up the SQL Server management studio in your PC then you can come up with this one and we’ve also set a particular tutorial where you can learn how to download and install SQL Server management studio and how to configure your SQL Server management studio and the link to that particular tutorial will be dropped in the description box below make sure to refer that in case if you wanted to execute these same codes in SQL Server management studio right so first we will be dealing with two different data table so we will be dealing with customer data and dealership data so the first table will be about the customer data where we will be having about the order ID audit date delivery date dealership in code product category and car fuel type Etc followed by that we will insert some rows into that particular customer table about 15 to 20 tables and after that we have dealership database where we will be having AIT date state region customer ID customer name primary foreign Etc and we will be inserting about 20 entries into that particular table don’t worry if you have more than 20 entries in case if you have more than 20 entries let’s say about 2,000 entries or 20,000 entries it’s not at all a big deal if you’re working on SQL Server studio right there you can use the wizard to just ingest the data from your Source into the SQL Server right you can use ssis tool to import all that data and you’re good to go you don’t have to manually create the data and you don’t have to manually insert the data just for the sake of learning the basic process of how to create a data table and how to insert the data table you’re going through this particular procedure so far so good so we have also inserted the data in the next data table which is our dealership data and now let’s query the data from our tables and now let’s select the execute data there you go we have the data table right over here and let’s copy this query and paste it to query the data from the customer table as well so instead of dealership we will be writing down customer so if You observe closely I’m using uppercase for us like select and from and sentence case or lower case for the variables the table names Etc so that there’s a difference between the keyword as well as the regular variables there you go we have executed the customer data and here we have it the order ID order date delivery date dat to deliver category and Etc now let’s mention the use cases corre so um let’s use double codes or maybe we can also use hasht along with the pipe symbol here or the slash symbol here and write the use cases and then proceed with executing the codes or queries so the first use case is let’s try to filter customers from specific regions and now let’s close this particular command proceed with writing the query so we will be using select keyword and we want customers from a specific region right so let’s write down the customer name so I think the customer name is not mentioned in the customer data but instead of customer name we have a order ID so let’s extract the order ID not a problem select order ID and dealership name for state and the important part which is about the region and what product did they and what’s the revenue that uh the dealership has extracted out of that particular customer from the table customer next line is about join dealership so we will be using join table dealership the keyword so we have one common keyword in between both the tables which is about the order ID which is equals to order ID so this is not done here since we are clubbing both tables we need to specify the tables here so dealership do order ID and customer. Order ID so that the SQL Server will identify via are trying to map two different tables and uh we are trying to combine and extract the data now we are trying to filter out uh customers from a specific region correct now we will be using we Clause to specify that particular region so let’s go with west region correct since we are using the text format here so it’s better we choose double codes correct now let’s run this query let’s give it a next try ambiguous order ID so um let’s do one thing we will pull customer table name here so that it’s no more ambiguous there you go so the thing is we have order ID in both the tables customer table and dealership table so SQL Server got confused here which particular table you want me to extract order ID from so if you mention customer table or dealership table it will choose a specific table and extract that particular column now we have the order ID dealership and all the members that we selected are from the vest region if you check here now let’s proceed with the Second Use case of today’s session or okay let’s continue with the same here so that it is clearly visible for us now let’s sort you’re using the sort command here products B B on the revenue so it’s like maybe uh highest revenue should be in the top or lowest Revenue should be in the top let’s go with the highest revenue okay sort products based on uh or based by Revenue we will write the command here itself we will go with uh some columns I would like to have product here so I’ll eliminate everything else and the product will be here and I want the revenue as well now I don’t want the revenue I want the total revenue so if you’re looking for total then you will be going with the function or aggregate function which is sum right I want the sum of all the revenues that particular product has earned throughout the years or that particular Financial Year from which table so I have both product and revenue in the customer table itself so I’ll go with customer table itself I don’t want to join anything here I’ll eliminate the join command and uh instead of region or instead of we command I would use the group by command here bu product and it is not done so far we want to order it so we want to keep it in descending order so that the highest grossing product is on the top and the lowest Crossing product is on the bottom right so that we can also make make sure that our inventory is filled with those products which are giving us High Revenue so we will be using or bu total revenue right so uh we can give it an Laos as total revenue all right now we can use this particular term here order by total revenue descending DEC is good enough now this is how you will sld the products now let’s execute this particular query there you go so the car Model T okay yeah so the car model j is giving us the highest revenue next is followed by TM and sng so we are not mentioning specific if you go back to the uh table here we are not mentioning the actual car brands and actual car Nam so that we don’t want any copyrights to be faced so we are just mentioning some random names not to be too specific there now let’s proceed with our next query for the session where we want to group by state and calculate total revenue let’s edit the same comment here don’t worry if you want this uh demo document we will also link that demo document with all edit and view rights so that all the viewers can have a quick glance and try to execute these queries in their own local systems okay now let me type Down group by state and calculate total revenue there you go now let’s try to edit the same query now we want statewise revenue and we want from customer table right we want to join dealership for this particular one because dealership is the one which has some details for it so the state data right here you can see we have inserted State data California Texas so we want we want the state data so it is present in dealership table so we also want to perform a join there no worries let’s edit the same query here so we want instead of product we want state and instead of sum of Revenue uh yeah we want Revenue so we will keep it as it is total revenue as total revenue some of Revenue as total revenue from customer table and we want to perform a join operation so let’s create some space join dealership okay uh let’s copy the name so that we don’t make any confusions here dealership sorry here don’t worry we’ll make some edits down the line dealership on order ID so let’s go with the order ID which is available here Order ID equals to order ID remember the first step we did eliminate the confusions by mentioning the table names dot column name and customer name sorry the customer table name do column name on yeah we did it on dealership do order IDE equals to customer. order ID now we will be performing the group buy operation we don’t want uh the order buy Here Group by state of course in case if you wanted to you know order by the highest crossing state you could have used order by there but so far according to the use case we don’t want that so let’s continue with the same exec tion here click execute and you will be having the answer here so based on the states you have their respective revenues there you go now let’s proceed with the next use case for today’s discussion which is about using the conditional statement like case using case query or clause for custom calculations we’ll perform some customized calculations now let’s say we wanted to find out a product which is uh giving Revenue greater than 10,000 as high Revenue let’s say we have revenues in terms of thousands of dollars and let’s have a benchmark like $10,000 and $10,000 is the minimum Revenue you wanted to extract out of that product and if it is not yielding at least 10,000 for your dealership then that product is not selling much so that you can at least make some space in your dealership so that you can import some products which are giving you highest uh Revenue right so we want to find out those uh products which are giving us revenue between 5,000 to 10,000 as medium revenue and less than 5,000 as low revenue and greater than 10,000 as high Revenue you understood the game right so we have three segments High Revenue medium okay okay type of Revenue Vue and least Revenue product so that you can eliminate the least Revenue products out of your inventory now we want order ID so copy the order ID from here and paste it here so we are looking for order ID and we are also looking for product so uh we don’t want some much revenue here let’s eliminate this one and instead of that let’s add product from we’ll extract this from okay before from uh this this form from statement will be at the last here we will begin with a use case right let’s type case now when a specific product or Revenue okay let’s revenue is greater than $10,000 then let’s not use sentence case let’s use keywords and caps then mention it as high or high Revenue let’s copy paste the same code here when revenue is between between 5,000 and we will be using an and operator here then term it as okay we don’t want alter average revenue or medium revenue and copy paste the same here we remove the alter and revenue is less than okay I think we don’t want to use this instead of this we can just place else right so we will place else it should be termed as low revenue or least let’s go with low Revenue there you go and now from which table you want to extract that I want to extract all this data from customer table so everything will go off apart from that now let’s execute this particular code let’s see if we get any errors incorrect Syntax for case I think we made some mistake here okay we forgot the comma here should have mentioned a comma and yeah so the thing is we by mistake chose the alter from suggestions alter keyword from suggestions so far we missed a comma and Al from suggestions I think it’s everything good to go let’s try to execute and if it faces some issues no problem we’ll try to resolve it in a different way so incorrect we missed to write the end okay we did not end the case okay okay fine fine not a problem end as okay uh let’s term it as uh the entire uh table as Revenue category since we are uh splitting into three categories Revenue categories there you go so this is the way to learn make some mistakes that uh you can learn in a better way for the next time you’ll never make a mistake so far so good so we have all the car models categorized into the uh revenue revenue categories High Revenue low Revenue so far we have all the cars in high Revenue not a bad deal I think all the cards are performing really well maybe if you change the numbers a little bit maybe if we take up one lakh in place of 10,000 then maybe we can get a couple of costs but so far so good this is how the query works now let’s uh switch to our next query next use case where we will be combining data from multiple tables using joh we already did that but still for the sake of uh learning experience we will also perform that particular operation so we will be naming this particular use case as combine data from multiple tables using a join so far you have already have a good experience on how joins work but still we will try to do that now we will use customer table here customer table. order ID and uh we want customer table do product and customer. Revenue and uh let’s also say take some data from dealership so we already know that some data of states is in dealership so we’ll also take the dealership data so uh dealership dot State also take the region dealership. region from customer table let’s eliminate the case statement from customer and we want to join let’s push this to the first line so that we don’t have confusion how and where the query is going on so here I want to join so the second line is all about join dealership so let’s copy this we we are combining dealership with customer on order ID in the same way we will take order ID equals to order ID now which order ID is equals to which order ID the first table customer. order ID is equals to the second table dealership order ID so that SQL Server understands which columns from which tables are being joined here for what reason right so far so good we have uh given the columns that we want to have in our output which are these columns and we are joining two tables based on certain criteria but now maybe we can specifically me mention some more data let’s say we already mentioned some uh query where we wanted data for a specific region let’s try to continue that where region is equals to West there you go let’s close the single code or double code now what happens is it is trying to give us the details of customer order ID product Revenue state region from customer table and we’re going to also execute or extract the details of state and region from the dealership table and we’re trying to join to extract those data with dealership data and we are specifically extracting the data where the region is West let’s execute the data don’t worry if we find any errors in this particular code there you go we have the result so we have all the details from the west region there you go let me expand this so that we have a better view anyways it’s okay now let’s go with the sixth quarium for today’s session where we will be executing a query based on having close in SQL server and we will try to filter out some states only show the states where total revenue is greater than 50,000 just like we discussed before we initially used a case statement with $110,000 at least Revenue but now let’s increase the number to 50,000 and we want those states which are giving us minimum 50,000 Revenue so let’s uh rename the use is here we will try to name it as filter data or groups with having with having Clause okay so we want uh State we have state here and let’s also count as total orders so order ID so customer. order ID will be now count of total orders so the count will be the aggregation function here count total order IDs and we can name it as total orders and uh do we have Revenue yes we have revenue and let’s also remove the product I don’t think we might need a product here and the aggregation will be some aggregation has sum of Revenue and I don’t think we need to specifically mention it here but anyways we’ll keep State first copy State and drop it here we want State count of orders as orders and sum Revenue as total revenue what this does is it’ll give a Clarity right so instead of uh if I don’t alos it what it shows is sum of Revenue and for a generic person it might not be as helpful as a data engineer can right for a data engineer data analyst he can totally understand by just reading the agregation function so we counting the orders here we are getting the sum of Revenue but for a mere person who just wants to see the report the business guy who just wants to see the report for him total orders or uh total revenue is a simplest language that he can understand correct now we are extracting this data from customer and uh we will join dealership again okay because we also want to extract the state right so we did not mention the state here but we can mention the state anyway okay we did mention the state so we since we have the state column for extracting State details will anyway go for the dealership so we are already joined dealership data with customer data and let’s do a group by function here Group by region Group by state actually and now the condition having some of Revenue at least or greater than 50,000 so we will go with the having Clause here and sum of Revenue or you can also use the aliah’s name total revenue greater than $50,000 so let’s execute this query and see if the alas works or not okay I think alas will not work here in case of alas let’s go with the ACT ual term and place it over here and now let’s try to execute this if it works or not okay we mentioned two things here region and the state so we just wanted State there not region I think this should solve the issue 20 rows affected ambiguous column order ID now we will let’s say take customer name and place it here or I think the dealership would be the best because we have the details of orders here no no no customer table has the order details so let’s keep it that way there you go so we have uh so many number of states which have Revenue greater than 50,000 so these are the outputs there you go now let’s proceed with the next query where we will be using the range function which is also known as the between function in SQL Server so let’s name the comment as using the range function let’s not uh capitalize it using the range function named between so between is the keyword right now let’s build a use case let’s identify couple of uh States we will use the same query will not make some major differences here so we we want to identify a couple of States whose revenue is greater than 50,000 but less than one L right this makes a good use case for that for that implementation of bit function we will keep all the uh columns as they are and maybe we can also include a couple of columns maybe product as well and uh we will join two tables now I think we don’t need a group by here but in place of group by and in place of having we will specify where some so to use where Clause I mean to use between Clause having is not the right keyword where is the right keyword so Su of Revenue should be in place of the symbol between you take the suggestions and one lakh yeah this is 1 2 3 4 5 zeros so this is is the right way now let’s try to execute if it works or not if there is some error yeah there is some error an aggregate may not appear where Clause unless it is subsidary okay contain having claes select I think this is the place where we can go with the alas name let’s try that or if areas does not work then we will go with a revenue okay let’s simplify this let’s not go with the join statement here let’s eliminate the join we shall just simplify we don’t want State uh we won’t okay let’s not count this there was no need for counting I think that was one error we will go with order ID I to L we don’t want this and anyways we want to take a look at the the product and dealership name as well and we also want revenue from customer table where revenue is between these numbers let’s try to simplify and run this query think it should work yeah it worked so these are the uh dealerships and products which yield in revenue between 50,000 to 1 lakh right so this is the best way to use it so we had some extra aggregate functions which complicated the query no no problem now let’s go with some pattern matching sometimes uh let’s say we are looking for sales from California something like that or we are looking for some car model correct and we don’t know the full name or we might have a spelling mistake right uh we have California here but let’s say we don’t know the spelling of California but we know the first three or four letters of California right so in those instance what you can do is try to match the pattern that we have in our hand with the pattern which is available in the data table and wherever it finds a match it extracts those rows so this is how the pattern matching works now let’s execute a query for a better understanding let’s keep the same query use okay let’s change uh the use case here using the pattern matching in SQL Server so we will have order ID as it is dealership name as it is and product as it is from customer table where product so let’s copy this and paste it here like or I like I like is uh the opposite so let’s go with like car model let’s check with the car models first so all car models are the same so me maybe if uh we go with the fuel type I think fuel type will help because all the card models have the keyword card model and only thing is an alphabet M Etc so it doesn’t make sense for us to execute bit uh the like operator here so instead of product let’s keep fuel type like hybrid correct let’s go with the hybrid one so we will take only the first four letters let’s imagine we don’t know the full spelling of hybrid so if the M pattern is like and since it’s a text we will use uh single quotes or double quotes if the pattern H hybr matches with the elements or the column present in the customer data it will pull all those uh columns or rules now let’s execute this so there you go okay we one important thing we missed to include the percentage this percentage symbol will make SQL Server understand that the text should have hybr and after that anything is accepted so anything which has a beginning with hybr should be pulled out right now let’s execute once again and see the data there you go you have it correct so remember the symbol percentage right and in case if you did didn’t knew uh the hybrid let’s say it has some ybr in its format right it’s it has some vbr in the format anything before vbr is okay anything after vbr is okay again it will yield the same result hybrid there you go and in case if you wanted to change let’s write the name of petrol I don’t know the spelling of petrol so let’s say p e d so let’s imagine I don’t know the name of petrol so instead of petrol I’m writing eliminating p and I’m writing p e t r and anything after that now let’s run this query and see the output there you go so we have all the vehicles which are of fuel type petrol so this is how you can use pattern matching in SEL server now let’s perform some calculations with dates right so we’ll perform some date and time calculations let’s rename the comment as state and time functions and time and some mathematical math Cals now let’s say you are the owner of dealership and uh let’s say for a specific car you are losing customers initially it was using uh initially it used to perform very good like many customers used to come for that specific car and right now that specific car is receiving less orders and you order to find out what is the reason behind it and after a General survey you come to know that the number of days you took to deliver that car is growing let’s say earlier you used to deliver that car in 2 days but right now you’re taking like two months to deliver that car right so that might be the reason but you wanted to make sure that you have a solid proof that you have the number of days that you’re looking at to show your sales team why is so many number of dayses being taken to deliver a card now for that let’s go with select order ID order date you want order date for it now you can simply copy the order date here and paste it there and you also want the delivery date and copy that and paste it that remember these date and Order date and delivery date are of date data type and now you want to calculate the difference so for that you use a function called Date diff or date diff so the the you know there’s there’s two ways of calling it so if you prefer calling it as date right diff or a you call it as dated if okay so there’s two ways of calling that function so I prefer to call date diff because date difference so date diff comma sorry uh Open Bracket so we want to provide some details here so I want to provide a day I want to count the days right so I want to count of days I’m mentioning day and Order date difference between the order date and the delivery date you can specify that using a comma so I want the difference between these two in case if you want to Alas it you can also do that days to deliver from customer and we don’t want this simply place a semicolon there and now just copy this and execute there you go now you have the uh number of days that you or your sales team is taking to deliver the vehicles so and on an average you’re taking about 10 days to deliver now you can prove this to your sales team and uh let them know this is not good you want to have at least minimum 4 to 5 days or 2 to 3 days to deliver a vehicle if this goes on the say might drop and it might be a little problematic right now let’s proceed with the next use case of today’s session about the temporary tables now why temporary tables so there comes a situation where you have to just run some numbers run some something very uh not too critical but you just wanted to do that so in such scenarios what you can do is you don’t want to harm the original table so what you can do is you can create a temporary table or a copy of that table which is somewhere in the intermediate memory storage and as as soon as you close the studio it fades away and nothing happens to your original data table so that is where the temporary tables come to come into picture now let’s create a simple use case for that particular temporary tables right so we will be selecting order ID and let’s also take product and let’s also take Revenue into temp or yeah it’s enough temp order tables I don’t want that temp order and I don’t want to it and from which table are you taking that I’m taking that from customers you can either choose to keep it in in a new line or you can keep it in the same line but I want to keep it in a new line from customers and where revenue is I said you right you wanted to just explore a few things which is not too mandatory uh but you just want to take a look at it that’s why you take the option of creating U temporary tables and I want to find out which are those uh products which are giving us Revenue greater than 1 to 3 4 five1 lakh dollar right now let’s run this it’s run it’s done but we don’t know where it is so what you can do is you just write up simple select query select stock from your temporary table close the query and run it there you go so far there are no such uh products which are giving us greater than one lakh sales of that single product in valid object name temp order okay there might be some error about this particular query let’s try to reduce the number did we miss anything where revenue is greater than okay this is a number we don’t want semicolon there okay we missed a semicolon here let’s run this execute let’s keep it as 10,000 okay okay so there was no certain product which was above one lakh that was a problem okay so these are the products which are giving us U 10,000 sales at least not a problem now let’s proceed with the next uh important part of today’s session which is about the Common Table Expressions so Remember Common Table expressions are also known as CTE play a very major role in realtime data analytics right so uh let’s have a sample of that let’s create a Common Table expression which is rather simple and uh just learn how technically it works right so the only difference is the CTE start with a keyword word named as withd and after that you can term your uh comment table as some data so or some name I’ll give it give it as sales data as and open a bracket and inside this bracket is where you write your actual query now let’s organize a few things I don’t want the temp order table here and I don’t want temp order here from okay I I want to extract a few columns I want uh maybe State and I maybe want uh I don’t want order ID I want Revenue maybe revenue is good so let’s sum the revenue and uh term it as total revenue from customer and now let’s proceed with a join command where we will be joining dealership data let’s copy the dealership data table name we will be joining based on the order ID that is common so uh maybe order ID copy the order ID is equals to order ID and which order ID are you talking about I’m talking about uh customers order ID and dealership data order ID so we need to mention that on dealership. order ID and customer. Order ID Group by state good since we took State as well we can use a group by function here and this is not the end of uh everything so semicolon may have to wait so this is the first part now using this particular CTE this entire term is CTE so this particular the output of this particular query will be stored in sales data now we will make use of this CT or sales data turn to extract a few more queries so we will be writing another select statement select state from from okay let’s also take the total revenue which we created here let’s take the same term here so that we don’t make a mistake from the CT which we created very recently which happens to be the sales data where total revenue is crossing something about $50,000 and now is the semicolon let’s try to execute this there you go so a few regions or a few States and total revenue Vue which are exceeding $50,000 are there you go what is post SQL postris SQL is an open-source object relational database management system it stores data in rows with columns as different data attributes according to the DB engines ranking post SQL is currently ranked fourth in popularity amongst hundreds of databases worldwide it allows you to store process and retrieve data safely it was developed by a worldwide team of volunteers now let’s look at the history of post Christ sequel so in 1977 onwards the Ingress project was developed at the University of California Berkeley in 1986 the post Chris project was led by Professor Michael Stonebreaker in 1987 the first demo version was released and in 1994 a SQL interpreter was added to postris the first postris sequel release was known as as version 6.0 or 6.0 on January 29 1997 and since then post SQL has continued to be developed by the post SQL Global Development Group a diverse group of companies and many thousands of individual contributors now let’s look at some of the important features of postest SQL so postest SQL is the world’s most advanced open source database and is free to download it is compatible as it supports multiple operating systems such as Windows Linux and Mac OS it is highly secure robust and reliable postp SQL supports multiple programming interfaces such as C C++ Java and python postp SQL is compatible with various data types it can work with Primitives like integers numeric string and Boolean it supports structured data types such as dat and time array and range it can also work with documents such as Json and XML and finally postris SQL supports multiversion concurrency control or mvcc now with this Theory knowledge let’s look at the post SQL commands that we will be covering in the demo so we will start with the basic commands such as select update and delete we will learn how to filter data using wear clause and having clause in SQL we will also look at how to group data using the group by clause and order the result using the order by Clause you will learn how to deal with null values get an idea about the like operator logical operator such as and and or we will also explore some of the popular inbuilt mathematical and string functions finally we’ll see some of the advanced concepts in postris SQL that is to write case statements subqueries and user defined functions so let’s head over to the demo now okay so let’s now start with our demo so first we’ll connect to post SQL using psql cell so here under type here to search I’ll search for psql you you can see this is the SQL cell I’ll click on open let me maximize this okay so for Server I’ll just click enter database I’ll click enter port number is already taken which is 5432 I’ll hit enter username is already given and now it is going to ask for password so here I’ll give my password so that I can connect to my post SQL database so it has given us a warning but we have successfully connected to postr SQL all right so now to check if everything is fine you can just run a simple command to check the version of post SQL that we have loaded so the command is Select version with two brackets and a semicolon I’ll hit enter okay you can see the version post SQL 13.2 okay now let me show you the command that will help you to display all the databases that are already there so if I hit /l and hit enter it will give me the list of databases that are already there so we have post SQL there’s something called template zero template 1 and we have a test database as well okay now for our demo I’ll create a new database so first I’ll write create space database and I’ll give my database name as as SQL demo I’ll give a semicolon and hit enter you see we have a message here that says create database so we have successfully created our SQL demo database now if you want to connect to that database you can use back SL c space SQL demo there you go it says you are now connected to database SQL demo so here here we can now create tables we can perform insert operation select operation update delete alter and much more now I’ll show you how to connect to post SQL using PG admin so when you install the post SQL database you will get the SQL cell and along with that you also have the PG admin so I’ll just search for PG you can see here it has prompted PG admin I’ll click on open this will open on a web browser you can see it has opened on Chrome and this is how the interface of PG admin looks like it is a very basic interface so on the top you can see the files we have object there’s tools and we have the help section as well and here you have dashboard properties SQL statistics dependencies dependence and here on the left panel you have servers let me just expand this so it will connect to one of the databases all right so if I go back you see when I had run back/ L to display the databases it had shown me post SQL and test now you can see here we have the post SQL database and the test database all right now we also created one more database which was SQL demo so let me show you how to work on this PG admin and the query tool all right so I’ll right click on SQL demo and I’ll select query tool I’ll just show you how to run a few commands on the query tool so let’s say you want to see the version of post SQL that you are using so you can use the same command that we did on psql Cell which is Select version closed with brackets and a semicolon I’ll select this and here you can see we have the execute button so if I hit execute or press F5 it will run that query you can see we have the output at the bottom and it says post SQL 13.2 compiled by visual C++ it has the 64-bit system okay now let me tell you how to perform a few basic operations using post SQL commands so here let’s say I’ll write select five into 3 I’ll give a semicolon select this and hit F5 so this will run the query and it returns me the result that is the product of 5 and three which is 15 similarly let’s edit this let’s say I’ll write 5 + 3 + let’s say six I’ll select this and hit F5 to run it it gives me the sum of 5 + 3 + 6 which is 14 now the same task you can do it on this cell as well let me show you how to do it here so let’s say I’ll write select let’s say I want to multiply 7 into let’s say 10 you know the result it should be 70 if I hit enter it gives me 70 now this question mark column question World we’ll deal with this later all right let me go back to my PG admin again let me do one more operation let’s say this time I’ll write select 5 multiplied by and within brackets I’ll write 3 + 4 I’ll give a semicolon so what SQL will do is first it will evaluate the expression that is there inside the bracket that is 3 + 4 which is 7 and then it will multiply 7 with 5 now let me select this and I’ll hit execute so you can see 7 * 5 is 35 all right now we’ll go back to our shell and here I’ll show you how to create a table so we are going to create a table called movies on the cell that is psql cell so here we will learn how you you can create a table and then you can enter a few data into that table all right let me just scroll down a bit okay so my create command goes something like this so I’ll write create table followed by the table name that is movies next my movies table will have a few columns let’s say I want the movie ID after the column name we need to give the data type so movie ID I’ll keep it as integer so integer is one of the data types that is provided by post SQL next my second column the table would be the name of the movie so I’ll write moviecore name so all the variables or the column name should be as per SQL standards so there shouldn’t be any space between the column name so I have used underscore to make it more readable so my movie name will be of type varar or variable character or varing character and I’ll give the size as 40 so that it can hold 40 characters maximum next my third column will have the genre of the movie so I’ll write moviecore Jer again joner is of type barar I’ll give the size as let’s say 30 and my final and the last column we’ll have the IMDB ratings so I’ll write IMDb underscore ratings now the ratings will be of type real since it can have floating or decimal point values if I close the bracket I’ll give a semicolon and I’ll hit enter there you go so we have successfully created a table called movies now let me go back to my PG admin all right so here I have my database that is SQL demo I’ll just right click on this and click on refresh now let me go to schemas I’ll just scroll down a bit here under schemas we have something called as tables let me expand this okay so you can see we have a table called movies in the SQL demo database now and here you can check the columns that we have just added so our movies table has movie ID movie name Jor and ratings all right now there is another way to create a table the previous time we created using the SQL cell now I’ll tell you how to create a table using the PG admin so here under tables I’ll right click and I have the option to create a table so I’ll select table okay so it’s asking me to give the name of the table so this time we are going to create a table called students so I’ll write my table name as students all right these will be default as it is now I’ll go to the columns tab so here you can create the number of columns that you want so you can see on the right I have a plus sign I’ll just select this so that I can add a new row so my first column would be let’s say the student role number I’ll write student uncore RO number again the column name should be as per SQL standards the data type I’m going to select is integer all right now if you want you can give these constraints such as not null so that student role number column will not have any null values and I’ll also check primary key which means all the values will be unique for role numbers all right if you want to add another column you can just click on that plus sign and let’s say this time I want to give the student name as my second column so I’ll write student underscore name student name will be of type let’s say character wearing if you want to give the length you can specify the length as well let’s say 40 I’ll click on the plus sign again to add my final column the final colum would be gender so gender I’ll keep this time as type character okay now you can click on save so that will successfully create your students table there you go so here on the left panel you can see earlier we had only one table that was movies and now we have two tables so one would be added that was students so if I expand this under columns you can see we have the three columns here student rule number student name and gender you can also check the constraints it will tell you if you have any constraints so you can see it says students rule number there’s one primary key all right all right now let me run a select statement to show The Columns that we have in the movies table so I’ll write select star from movies give a semicolon and let me execute this okay so here on the at the bottom you can see we have the movie ID the movie name movie JRE and IMDb readings now the next command we are going to learn is how to delete a table so there is one way by using the SQL command that is drop table followed by the table name let’s say you want to delete students you can write drop table students and that will delete the table from the database this is one of the methods so you just select and run it now the other way is to you just right click on the table name and here you have delete slash drop if I select this you get a prompt are you sure you want to drop table students I’ll select yes so you can see we have successfully deleted our students table all right now let’s perform a few operations and learn a few more commands in post SQL so to do that I’m going to insert a few records to my movies table so for that I’ll use my insert command so I have my insert query written on a notepad I’ll just copy this and I’ll paste it on my query editor okay so let me just scroll down all right so here you can see I have used my insert command so I have written insert into the name of the table that is movies and we have the movie ID the movie name movie Jer and IMDb readings and these are the records or the rows so we have the first record as movie ID 101 the name of the movie is a very popular movie which is vertigo then we have the movie genre that is Mystery it is also a romance movie and then we have the IMDb readings the current IMDb readings that is 8. three similarly we have sank Redemption we have 12 Angry Men there’s the Matrix 7 inter staler and The Lion King so there are total eight records that we are going to insert into our movies table so let me just select this and hit execute okay you can see it has returned successfully eight records now if I run select star from movies you can see the records that are present in the table so I’ll write select star from movies I’ll select this and I’ll execute it there you go at the bottom you can see eight rows affected if I scroll this down you have the eight records of information in the movies table all right now if you want to describe the table you can go to the SQL cell and here if you write back SL D and the name of the table that is movies this will describe the table so here you have the column names this has the data type and here you can specify if there are any null values or any constraints like default constant or primary key or foreign key and others let me go back to my PG admin okay now first and foremost let me tell you how to update records in a table so suppose you have an existing table and by mistake you have entered some wrong values and you want to update those records later you can use the update query for that so I’m going to update my movies table and I’ll set the genre of movie ID 103 which is 12 Angry Men from drama to drama and crime so in our current Table we only have Jon as drama for 12 Angry Men I’m going to update this column which is the movie genre to drama and crime okay so let me show you how to do it I’ll write update followed by the name of the table that is movies go to the next line I’ll write set then I’ll give the column name which is moviecore Jer equal to I’m going to set it as drama comma crime earlier it was only drama and I’ll give my condition using the wear Clause we’ll learn where clause in a bit so I’ll write where moviecore ID is equal to 103 so here our movie ID is the unique identifier so it will first look for movie ID 103 it will locate that movie and it change the genre to drama and crime so now you can see the difference earlier we had 12 Angry Men as drama as the movie genre now if I run this update statement okay you can see we have successfully updated one record now let me run the select statement again okay so here you can see if I scroll down there you go so movie ID 103 movie name 12 Angry Men we have successfully updated the genre as drama comma crime okay now let me tell you how you can delete records from a table so for that you can use the delete command so you’ll write delete from the table name that is movies where let’s say I want to delete the movie ID 108 which is The Lion King so I’ll write where moviecore ID is equal to 108 this is one of the ways to delete this particular movie or you can give let’s say where movie name is equal to The Lion King let me select this and I’ll hit execute now if I run my select query again you see this time it has returned seven rows and and you cannot find movie with movie ID 108 that was The Lion King so we have deleted it all right next we are going to learn about we clause in post SQL so to learn we Clause I’ll be using the same movie table again let’s say we want to filter only

    those records for which the IMDB ratings of the movies is greater than 8.7 so this is my updated table now I want to display only those records or those movies whose IMDB ratings is greater than 8.7 so we’ll display 12 angry man which is 9 then we are going to display The Dark Knight which is again 9 and we are also going to display the sank Redemption which has 9.3 the rest of the movies have and IM to be rating less than 8.7 so we are not going to display those all right right so let me show you how to write a wear Clause so I’ll write select star from movies where I’ll give my column name that is IMDB ratings is greater than I’ll use the greater than symbol then I’ll pass my value that is 8.7 I’ll give a semicolon and let’s run it I’ll hit F5 there you go so we have returned the sashank Redemption The Dark Knight and 12 Angry Men because only these movies had IMDB ratings greater than 8.7 okay now let’s say you want to return only those movies which have IMDB ratings between 8.5 and 9 so for that I’m going to use another operator called between along with the wear Clause so let me show you how to use between with wear clause I’ll write select star from movies where my IMDb underscore ratings is between I’ll write 8.5 I’ll give an and operator and 9.0 so all the movies that are between 8.5 and 9.0 ratings will be displayed so let’s select this and I’ll run it there you go so we have returned the Dark Knight The Matrix the seven interal and we have the 12 Angry Men so a few of the records that we missed out where I think vertigo which has 8.3 and there’s one more all right now moving ahead let’s say you want to display the movies whose movie genner is action you can see in a table we have a few movies whose genre is action movie so you can do that as well I’ll write select star from movies where the movie genre I’m writing this time in one line you can break it into two lines as well I’ll write moviecore Jer which is my column name equal to I’ll give within single quotes action now why single code because action is a string hence we need to put it in single codes if I run this there you go so we had one movie in our table whose movie genre was action that is The Dark Knight okay now you can also select particular columns from the table by specifying the column names now here in all the examples that we saw just now we are using star now star represents it will select all the columns in the table if you want to select specific columns in the table you can use the column names so you can specify the column names in the select statement let me show you let’s say you want to display the movie name and the movie genre from the table so you can write select moviecore name Comm I’ll give the next column as moviecore Jer from my table name that is movies where let’s say the IMDB uncore ratings is less than 9.0 so this time in our result it will only show two columns that is movie name and movie JRE let me run it there you go so these are the movie names and the movie genners you can see that have an IMDB ratings less than 9.0 all right like how you sh the between operator there is one more operator that you can use with the we Clause that is the in operator so the in operator works like a r clause or an or operator so let’s say I want to select all the columns from my movies table where the IMDB ratings is in 8.7 or 9.0 if I run this it will display only those records whose IMDB ratings is 8.7 or 9.0 all right so up to now we have looked at how you can work on basic operations in SQL like your mathematical operations you saw how a select statement works we created a few tables then we inserted a few records to our tables we saw how you can delete a table from your database and we have performed a few operations like update delete and we saw how a wear Clause works now it’s time to load a employee CSV file or a CSV data set to post SQL so I’ll tell you how you can do that but first of all before loading or inserting the records we need to create an employee table so let me first go ahead and create a new table called employees in our SQL demo database so I’ll write create table my name of the table would be employees next I’m going to give my column names so my first column would be employee ID so the employee ID will be of type integer it is not going to contain any null values so I’ll write not null and I’ll give my constraint as primary key so the employee ID as you know is unique for all the employees in a company so once I write primary key will ensure that there are no repetition in the employee IDs okay next I’ll have my employee name so my employee name is going to be of type varar and I’ll give my size as 40 okay next we’ll have the email address of the employee again email address would be of type varar and the size is 40 again I’ll give another comma this time we’ll have the gender of the employee gender is again worker of size let’s say 10 okay now let’s include a few more columns we’ll have the department column so I’ll write Department varar let’s say the size is 40 then let’s say we’ll have an another column that is called address so the address column will have the country names of the employees address is also VAR car and finally we have the salary of the employee salary I’m going to keep it as type real so real will ensure it will have decimal or floating Point values okay so now let me select this create table statement and execute it all right so we have successfully created our table if you want you can check by using select star from employees let me select this and I’ll hit execute all right you can see we have our employee ID as primary key there’s employee name email gender this department address and salary but we don’t have any records for each of these columns now it’s time for us to insert a few records to our employees table now to do that I’m going to use a CSV file so let me show you how the CSV file looks like okay so now I am on my Microsoft Excel sheet and on the top you can see this is my employe data. CSV file here we have the employee ID the employee name email gender this department address and salary now this data was generated using a simulator so this is not validated and you can see it has a few missing values so under email column we have a few employees who don’t have an email ID then you can see under Department also there are some missing values here as well all right so we’ll be importing this table or the records present in this CSV file onto postr SQL all right so here in the left panel under tables let me right click and first refresh this there you go so initially we had only movies table and now we also have the employees table now what we need to do is I’ll right click again and here you see we have the option to import or export let me click on this and I don’t want to export I need to import so I’ll switch on import all right now it is asking me to give the file location so let me show you how to get the file location so this is my file location actually so my Excel file which was this is present in my e Drive under the data analytics folder I have another folder called post SQL and within the postc SQL folder I have my CSV file that is employee data. CSV so I’ll just select this you can either do it like this or you can browse and do okay now my format is CSP next I’m going to select my headers as yes and then let me go to columns and check if everything is fine all right so I have all my columns here let’s click on okay you can see I have a message here which says import undor export all right so here you can see successfully completed we can verify this by using select star from employees again if I run this all right let me close this there you go it says 150 rows affected which means we have inserted 150 rows of information to our employees table you can see we have the employee ID these are all unique we have the employee name the email we have the address and the salary let me scroll down so that okay you can see we have 150 rows of information that means means we have 150 employees in our table okay now we are going to use this employees table and explore some Advanced SQL commands now there is an operator called distinct so say if I write select address from employees this is going to give me 150 address of all the employees there’s some problem here I did a spelling mistake there should be another D if I run this again AL query will return 150 rows you can see we have the different country names under address that is Russia we have France the United States we have Germany okay and I think we have Israel as well yeah now suppose you want to display only the unique address or the country names you can use the distinct keyword before the column name so if I write select distinct address from employee it will only display the unique country names present in the address column if I run this see it has returned us six rows of information so we have Israel Russia Australia United States France and Germany all right now as I said there are a few null values which don’t have any information so you can use the isnull operator in SQL to display all the null values that are there suppose I want to display all the employee names where the email ID has a null value so I’ll write select star from employees where email is null so this is another way to use your wear Clause if I select and run this there you go so you see here for all these employee names there was no email ID present in the table so it has written us 16 rows of information so around 10% of employees do not have an email ID and if you see a few of them do not have an email ID and also they don’t have a department so if you want to know for those employees which do not have a department you can just replace where department is null instead of where email is null now if I select this okay it has written us nine rows of information which means around 5% of employees do not have a department moving ahead now let me show you how the order by Clause Works in SQL now the order buy is used to order your result in a particular format let’s say in a sending or descending order so the way to use is let’s say I want to select all the employees from my table so I’ll write select star from employees order by I want to order the employees based on their salary so I’ll write order by salary let me select and run it okay there is some problem I made a spelling mistake this should be employees let me run it again okay now if you mark the output a result has been ordered in ascending order so all the employees which have salary greater than $445,000 appear at the top and the employees with the highest salaries appear at the bottom so this has been ordered in ascending order which means your SQL or post SQL orders it in ascending order by default now let’s say you want to display the salaries in descending order so that all the top ranking employees in terms of salary appear at the top so you can use the dec keyword which means desending if I run this you can see the difference now so all the employees with the highest salary appear at the top while those with the lowest salaries appear at the bottom so this is how you can use an order by Clause okay so now I want to make a change in my existing table so here if you see under the address column we only have the country names so it would be better if we change the name of the address column to Country so I want to rename a column you can do this using the alter command in post SQL so let me show you how to rename this column that is address so I’ll write alter table followed by the table name which is employees then I’m going to use rename column address I’ll write two I want to change it to Country if I give a semicolon and hit execute it will change my column name to Country now you can verify this if I run the select statement again there you go earlier it was address column and now we have successfully changed it to Country column okay let me come down now it’s time for us to explore a few more commands so so this time I’m going to tell you how an and and an or operator Works in SQL so you can use the and and or operator along with the wear Clause so let’s say I want to select the employees who are from France and their salary is less than $80,000 so let me show you how to do it I’ll write select star from employees where I’m going to give two conditions so I’ll use the and clause or the and operator here I’ll write where country is equal to France now Mark here I’m not using address because we just updated our table and changed the column name from address to Country so I’ll write country equal to France and by next condition would be my salary needs to be less than $80,000 I’ll give a semicolon let me run this all right so it has returned 19 rows of information you can see all my country names are France and the salary is less than $80,000 so this is how you can use or give multiple conditions in a we Clause using the and operator now let’s say you want to use the or operator and let’s say you want to know the employees who are from country Germany or the department should be sales so I’ll write select star from employees where country is equal to Germany and instead of and I’m going to use or their depart M should be sales okay now let’s see the output I’ll hit F5 this time to run it all right so we have 23 row of information now let me scroll to the right you can see either the country is Germany or the department is sales you see one of them in the table so here for the first record the country was Germany the second record the department was sales again sales again for the fourth record the country is Germany so this is how the or condition works so if one of the conditions are true it will return the result it need not be that both the conditions should satisfy now in post SQL there is another feature that is called limit so post SQL limit is an optional clause on the select statement now this is used as a con ST which will restrict the number of rows written by the query suppose you want to display the top five rows in a table you can use the limit operator suppose you want to skip the first five rows of information and then you want to display the next five you can do that using limit and offset so let’s explore how limit and offset works I’ll write select star from employees let’s say I’ll use my order by Clause I’ll write order by salary let’s say in descending and limit it to five this is going to display the top five employees which have the highest salary if I run this there you go you see it has given us five rows of information and these are are the top five employees that have the highest salary okay so this is one method of or one way of using the limit Clause now in case you want to skip a number of rows before returning the result you can use offset Clause placed before the limit Clause so I’ll write select star from employees let’s say order by salary descending this time I’m going to use limit five and offset three so what this query will do is it will skip the first three rows and then it will print the next five rows if I run this there you go so this is how the result looks like okay now there is another class which is called Fetch let me show you how that works I’ll copy my previous SQL query I’ll paste it here and here after descending I’m going to write fetch first three row only so my fetch is going to give me the first three rows from the top there you go it has given us the first three rows and you can see the top three employees that have the highest salary since we ordered it in descending order of salary all right you can also use the offset along with the fetch Clause I’ll copy this again and let me paste it here now after descending I’m going to write offset let’s say three rows and fetch first five rows only so what this SQL query is going to do is it will skip the first three rows of information and then it is going to display the next five rows it is going to work exactly the same as we saw for this query let me run it there you go so these are the first five rows of information after excluding the top three rows all right we have another operator that is called as like in post SQL so like is used to do pattern matching so suppose you have a table that has the employee names you forgot the full name of an employee but you remember the few initials so you can use the like operator to get an idea as to which employee name it is now let’s explore some examples to learn how the like operator Works in post SQL so suppose you want to know the employees whose name starts with a so for that you can use the like operator let me show you how to do it so I want to display the employee name and let’s say I want to know their email IDs from the table name that is employee where since I want to know the employees whose name starts with a so I’ll write employee name like now to use the pattern is within single course I’ll write a and Then followed by percentage now this means the employee name should have an e in the beginning and percentage suggest it can have any other letter following a but in the beginning or the starting should be a if I run this so there is an error here the name of the table is employees and not employee let’s run this again there you go you can see there are 16 employees in our table whose name starts with a you can see this column employee name all of them have a letter A in the beginning okay now let me just copy this command or the query I’ll paste it here let’s say this time you want to know the employees whose name starts with s so instead of a I’ll write s so this means the starting letter should be S and followed by it can have any other letter if I run this so there are 10 employees in the table whose name starts with s okay let’s copy the query again and this time I want to know the employees whose name ends with d now the way to do it is instead of a percentage I’ll write this time percentage D which means at the beginning it can have any letter but the last letter in the string or in the name should be ending with d now let me copy and run this so there are 13 employees in the table whose name ends with a d you can see it here all right now let’s say you want to find the employees whose name contains ish or have ish in their names so the way to do is something like this so I’ll copy this now here instead of a percentage I’ll replace this with percentage ish percentage now this means that in the beginning it can have any letter and towards the end also it can have any letter but this is message should appear within the name let me run and show it to you okay so there is one employee whose name contains ish you can see here there is an ish in the last name of the employee all right now suppose you want to find the employee name which has U as the second letter it can have any letter in the beginning but the second letter of the employee name should have U now the way to do is I’ll copy this and instead of a% I’ll write underscore U followed by percent now this underscore you can think of a blank that can take any one letter so the beginning can start with a B C D or any of the 26 alphabets we have then then it should contain you as the second letter followed by any other letter or letters let me run this okay so there are 10 employees in the table whose name has a u as the second letter you can see these okay now moving ahead let me show you how you can use basic SQL functions or inbuild functions so we’ll explore a few mathematical functions now so let’s say you want to find the total sum of salary for all the employees so for that you can use this sum function that is available in SQL so I’ll write sum and inside the sum function I’ll give my colum name that is salary from my table name that is employees let’s see the result this will return one unique value there you go now this is the total salary since the value is very large it has given in terms of E now one thing to note here is if you see the output the column says sum real so this output column is not really readable so SQL has a method which can fix this that is called an alas so since we are doing an operation of summing the salary column we can give an alas to this operation by using the as keyword so if I write sum of salary as let’s say total salary then this becomes my output column you can see the difference if I run this okay you can see now in the output we have the total salary now this is much more readable than the previous one so this is a feature in Excel where you can use or give alas names to your columns or your results now similarly let’s say you want to find the average of salary for all the employees now SQL has a function called AVG which calculates the mean or the average salary if I write AVG and I can edit my alas name as well let’s see I’ll write mean salary let’s run it you can see the average salary for all the employees it’s around $81,000 okay now there are two more important functions that SQL provides us which is Max and minimum so if I write select maximum or Max which is the function name of salary as let’s say instead of total I’ll write maximum so this will return me the maximum salary of the employee let’s run it and see what is the maximum salary that is present in the salary column all right so we have 9,616 as highest salary of one of the employees similarly you can use the minan function as well I’ll just write minimum and this will return me the minimum salary of one of the employees in the table I’ll replace the alas name as minimum okay now run it this will give me the minimum salary that is present in our table so it is $ 4,685 okay now let’s say you want to find the count of Department in the employees table you can use the count function so if I write select count let’s say I want to know the distinct Department names so I can write inside the count function distinct Department as total departments from employees let’s run this this will return me the total number of departments that are there so it gives me there are departments okay now let me show you one more thing here if I write select Department from employees let’s run this okay so it has returned me 150 rows of information but what I’m going to do is I’ll place my distinct keyword here just before the call name so that I can verify how many departments are there in total there you go so there are 13 departments and one of them is null so moving ahead we’ll replace this null with a department Name by updating a table okay so now let’s update our department column so what we are going to do is wherever the department has a null value we are going to assign a new Department called analytics so earlier we have also learned how to use the update command so I’m going to show it again so we’ll write update followed by the table name that is employees I’m going to set my column that is Department equal to within single codes my name of the department would be Analytics where department is I’ll say null so wherever the department has a null value we’ll replace those information with Department that is analytics let’s run this you can see query returned successfully now let’s say I’ll run this command again and this time you can see the difference there you go so we have 13 rows of information and there is no null department now we have added a new department that is analytics okay now we are going to explore two more crucial commands or Clauses in SQL that is Group by and having so let’s learn how Group by Clause Works in post SQL so the group by statement groups rows that have the same values into summary rules for example you can find the average salary of employees in each country or city or department so the group by Clause is used in collaboration with the select statement to arrange identical data into groups so suppose you want to find the average salary of the employees based on countries you can use the group by Clause so let me show you how to do it I’ll write select I want the countries and the average salary for each country so I’ll use the average function that is AVG and inside the function I’ll pass my column that is salary I’ll give an alas name as let’s say average uncore salary from my table name that is employees next I’m going to use my group by Clause so I’ll write Group by since I want to find the average salary for each country so I’ll write Group by country name let’s give a semicolon and let me run it I’ll use F5 there you go so here on the left you can see the country names we have Israel Russia Australia United States France and Germany and on the right the second column you can see the average salary for each of these countries now you can also order the result in whichever way you want suppose you want to arrange the results based on the average salary so you can use the order by Clause after the group by Clause so I’ll write order by here you can use the alas name that is average salary this is actually average uncore salary and let’s say I want to arrange it in descending order so I’ll write DSC now let’s run this you can mark the difference in the average salary column there you go so as per our result in United States the average salary is the highest and if I scroll down the average salary is the lowest in Germany now let’s see one more example using group buy suppose this time you want to find the maximum salary of male and female employees you can do that too so let me show you how to do it so I’ll write select this time we want to find the max salary based on gender so I’ll select my gender column comma and this time I’ll use my Max function since I want to find the maximum salary for male and female employees I’ll give an alas name as maximum underscore salary from my table that is employees Group by I’ll write gender okay so let’s run this there you go you can see so one of the female employees had a highest salary of $11 19,618 while of that of a me was $ 17,6 54 all right now suppose you want to find the count of employees based on each country you can use the count function along with the group by Clause so so I’ll write the select statement select since I want to count the employees based on each country so I’ll first select my country column and then I’m going to use the count function I’ll write count empore ID from my table name that is employees I’m going to group it by country so this query will give me the total number of employees from each country you can see here Israel there are four employees in Australia there are four employees in Russia we have 80 employees in France there were 31 in United States we have 27 so on and so forth let me scroll down okay now it’s time to explore one more Clause a very important Clause that is used in post SQL that is having so the having Clause works like the wear Clause the difference is that wear Clause cannot be used with aggregate functions the having Clause is used with a group by Clause to return those rows that meet a condition so suppose you want to find the countries in which the average salary is greater than $80,000 so you can use the group by clause and the having Clause to get the result so I’ll write my select statement as select country comma I want the average salary so I’ll write AVG of salary I can give an alas name as average salary from employees now I’m going to group it by each country so Group by country column since I want to find the countries in which the average salary is greater than 80,000 so I’ll use having Clause after the group by Clause I’ll write having average of salary is greater than 880,000 now this condition cannot be specified in the wear Clause so we need a having Clause you cannot use aggregate functions along with wear Clause let me just run it now there you go so we have Russia and United States where the average salary is greater than $80,000 all right now let’s say you want to find the count of employees in each country where there are less than 30 employees so for this I’m going to use the account function first let me select the country column then I’m going to use the count function and in the count function I’m going to pass my employee ID so that we can count the number of employees from my table that is employees now if you want you can use an alas name for this as well but I’m just skipping it for the time being I’ll write Group by country next I’ll write having count of employee ID less than 30 so this will return me the countries in which there are less than 30 employees let’s run it you can see here Israel Australia United States and Germany are the countries in which there are less than 30 employees okay now if you want you can use the order by Clause as well so suppose I’ll write here order by count of employee ID so what this will do is it will arrange my result in ascending order of employee ID count there you can see we have successfully arranged our result in ascending order of employee IDs okay next we are going to explore one more feature of post SQL that is of using a case statement now in post SQL the case expression is same as IFL statement in any other programming language it allows you to add ifls logic to the query to form a powerful query now let me just scroll down and I’ll show you how to use a case statement this is very similar to your eel statement that you use on Excel in C++ in Python and or any other programming language so what I’m going to do is I’m going to write a SQL query that will create a new column and the name of the column would be let’s say salary range so I’m going to divide my salary suppose if the salary is greater than $45,000 and if it’s less than $55,000 in the new column that is salary range we are going to assign a value low salary now if the salary is greater than $55,000 and if it is less than $80,000 we are going to assign a value that is medium salary if the salary is greater than $80,000 we’ll assign a value High salary so all this we are going to do using our case expression in post SQL so I’ll start with my select statement but before that let me show you how to write a comment in post SQL so you can write a comment by giving a Double Dash comments are very helpful because they make your codes or the scripts readable I’ll write case expression in postc SQL similarly if you want you can go to the top and let’s say here you can write with Double Dash having Clause okay let’s come down so I’ll write my select statement as select I want the department the country and the salary column I’ll give a comma and I’ll start with my case statement I’ll write case when my salary is greater than 45,000 and my salary is less than 55,000 then the result would be B within single codes I’ll write low salary so this is exactly like an if else condition next I’ll write another case when salary is greater than 55,000 and salary is less than let’s say 80,000 then then the result would be medium salary and finally I’ll give my last condition that is when salary is greater than 80,000 then the result will be high salary let me write this in a single line then High salary now one thing to remember in post SQL the codes are insensitive so you can write your select statement in capital in lower case or in sentence case similarly I can write case as small C or you can write as capital c all right now moving ahead after this I’m going to write end I’ll give an alas name as salary range now this is going to be my new column in the output let me just come down after this we need to give our table name from employees I’ll order it by salary descending okay so what I’m going to do here is I’ll first select Department country and salary column from my employees table and then I’m creating a new column that is salary range and I’m specifying the range so I have three conditions here for low salary for medium salary and high salary so let’s run this and see the output there you go here you can see we have added a new column known as salary range and we have order our salary in descending order so all the highest salaries appear at the Top If I just scroll down you can see we have medium salaries here and if I scroll down further you can see these low salaries so case statements are really useful when you want to create a new column based on some conditions in the existing table all right now moving ahead we are now going to see how to write subqueries in post SQL so subqueries we write a query inside another query which is also known as nested query so suppose we want to find the employee name Department country and salary of those employees whose salary is greater than the average salary so in such cases you can use subqueries now let me show you how to write a query inside another query first I’ll write the select statement I’m going to select the employee name comma I want the department comma also want to display the country name and the salary from the employees table where my salary should be greater than the average salary so after this be salary greater than I’m going to use brackets and write my subquery that is Select average salary from employees now let me break it down for you so first we are going to select the average salary from the employees so this particular SQL statement will find the average salary from the table we’ll compare this average salary with salaries of all the employees so whichever employee has the salary greater than the average salary will display their names the department country and their original salary so if you want you can run this statement as well let me select this statement and run it for you you can see we have returned the average salary of all the employees which is nearly $81,400 employees whose average salary or whose salary is greater than the average salary all right now moving ahead this time I’m going to tell you how to use some inbuilt functions we’ll learn some inbuilt mathematical functions and string functions that are available in postris SQL so I’ll just give a comment there’s another way to write a comment instead of a Double Dash you can use the forward slash an asteris and inside the asteris you can write let’s say SQL functions and you need to close this so I’ll give another ASX and a forward slash so this is also a comment in postris SQL all right so first of all we’ll explore a few math functions so there is a function called ABS which is used to find the absolute of a value so if I write select abs of let’s say minus 100 it is going to return me positive 100 or just 100 because as you know the absolute of any value will remove the negative sign involved in that value there you go so our original input was – 100 the absolute of- 100 is + 100 next let’s see another function that is called greatest so the greatest function in postcript SQL will return the greatest number in a range of numbers so suppose I write select greatest inside the greatest function I’ll pass in a few numbers let’s say two I’m just randomly passing a few numbers let’s say 4 90 let’s say 56.5 and let’s say 70 I’ll give a semicolon let me run this you will see the greatest function will return the greatest integer value or greatest number that is present in the range of numbers that we have provided so in this case 90 was the largest number or the greatest number so we got the result as 90 again you can use an alas for each of these statements now like greatest we also have a function called least which is going to return the least number present in a range of numbers if I run this so the result is two because two is the least number that is present in this selection all right now there’s a function called mod which is going to return the remainder of a division so suppose I write select mod and this takes two parameters let’s say 54 divided 10 as you can guess the remainder is 4 and so is our result you can see it has return the remainder 54 divided by 10 the remainder is 4 all right if I scroll down now let’s see how to use the power function so I’ll write select power let’s say I want to know power 2 comma 3 which is 2 Cube that is 8 let me just run this there you go so the result is 8 you can also check let’s say power of 5 comma 3 it should be 125 all right next you can use the sqrt function that is available in post SQL to find the square root of a number I’ll write sqrt and let’s say I want to find the square root of 100 you can get guess the result the output should be 10 if I run this you can see the output here 10 let’s say I want to find the square root of let’s say 144 you can again guess the result it should be 12 let’s verify it okay there is some error let me verify it again there you go it is 12 now there are a few trigonometric functions as well you can use the S function the COS function and the tan function let’s say I want to know the sign of 0 if you have studied High School mathematics you would know the sign of 0 is 0 you can see the result it is zero let’s say you want to know s 90 if I run it you can see the output here 89 all right now there are other functions like C and floor that you can use so let me show you what the ceiling and floor function does I’ll write seiling let’s say I’ll pass my floating value as 6.45 and let me run it you can see the ceiling function Returns the next highest integer that is seven in this case since the next highest integer after 6.45 is 7 let’s see what the floor function does and let me run it as you can see the floor function Returns the next lowest integer that is six in this case or the nearest lowest integer to any provided decimal value okay now that we saw how to use mathematical functions there are a few string functions available in postr SQL so let’s explore them as well I’ll write string functions okay we scroll down cool there’s a function called character length that gives you the length of a text string suppose I write select give the function as character length and inside this function I’m going to pass in a text let’s say India is a democracy this is my text let me run this okay you can see the result here which is 20 since there are 20 characters in my string that I have provided all right now there’s another function called concat in po SQL so concat is basically used to merge or combine multiple strings so I’ll write select concat within brackets I’ll give the text string now let’s say I want to combine postest SQL I’ll give a space comma I want to merge post SQL is I’ll give another comma and write my final word that is interesting now what we have done is inside the concat function we have passed in separate strings and now using the concat function we want to merge the three strings let’s see what the result is I’ll run it all right let me just expand this you can see here we have concatenated the three string successfully so the output is post SQL is interesting okay now there are functions like left right and mid in postc SQL so what the left function does is it will extract the number of characters that you specify from the left of a string let’s say I’ll write select left and I’ll pass in my text string as India a democracy I’ll copy this and I’ll paste it here let’s say I want to extract the first five characters from my string so I’ll give five so what it will do is it will count five characters from left so 1 2 3 4 and five if I run this it should ideally print India for me there you go it has printed India for us all right similarly you can use the right function to extract few characters from the right of a string let’s say you want to extract let’s say I’ll give 12 characters from right so from here onwards it will count 12 characters I’ll change left to right now let me select this and run it so you can see here this is the output from the right it has counted 12 characters and returned a democracy okay now there is a function called repeat so the repeat function is going to repeat a particular string the number of times you specify let’s say I want to select and use my repeat function and inside the repeat function I’m going to pass in let’s say India and I want India to be displayed five times I’ll give a semicolon and run it in the output you can see India has been printed five times okay let’s scroll down there is another function a string function in postc equl called as reverse so what reverse function is going to do is it is going to print any string passed as an input in reverse order so if I write select reverse and inside the reverse function I’ll pass in my string that is India is is a democracy I’m going to use the same string I’ll copy this and I’ll paste it here I close the codes and the brackets let’s print this you can see it here India is a democracy has been printed in reverse order there you go all right now this time we explored a few inbuilt functions that are already present in postris SQL now post SQL also has the feature where you can write your own user defined functions so now we will learn how to write a function of Our Own in post SQL so let’s create a function to count the total number of email IDs that are present in our employees table so for this we’ll write a function a user defined function so let me give my comment as user defined function okay so let me start by first writing create so this is the syntax to write a function in post SQL so I’ll write create or replace function then I’ll give my function name as count emails and as you know functions have brackets then I’ll write Returns the return type as integer then an alas with dollar symbol I’ll write total emails since I’m going to display the total number of email IDs that are present in my table I’ll close the dollar symbol then I’m going to declare a variable the variable name is going to be total underscore emails this is of type integer I’ll write begin and inside begin I’ll write my select statement so I’ll write select I want to count the email IDs that are present so I’ll pass my call column name that is email into total emails from my table name that is employees I’ll give a semicolon and then we’ll write return total emails as you know user defined functions often return a value so hence we have mentioned the return statement as well and now I’m going to end my function then the next syntax would be let me just scroll down okay so here I’ll give my dollar symbol again followed by total underscore emails next I’ll write my language as post SQL so the way to mention is PL p g SQL let’s give a semicolon and end it so this is my user defined function that I have written so I created a function with the function name countor emails and this would return integer as an alas which is total _ emails we declared that variable as an integer then we started with a begin statement that has my select statement where I’m selecting the count of email IDs that are present in the employees table and I am putting the value into total _ email so I’ve have used the into keyword and this Returns the result as total _ emails and I have ended let’s run this okay there is some problem there is an typo so this should be integer okay let me run it once again there you go so youve successfully created a user defined function now the final step is to call that function now to call this function I’m going to use my select statement and the function name that is countor emails I’ll give a semicolon let’s execute this there you go so here you can see there are 134 email IDs present in our employees table now one thing to Mark is there are total 150 employees in the table but out of them 134 employees have email IDs the rest of them don’t have so they would ideally have null values all right so that brings us to the end of this demo session on post SQL tutorial let me go to the top we have explored a lot so we started with checking the version of post SQL then we saw how to perform basic mathematical operation that is to add subtract multiply then we saw how to create a table that was movies we inserted a few records to our movies table then we used our select Clause we updated a few values then we deleted one row of information then we learned how to use the we Clause we learned how to use the between operator we also learned how to use the in Operator Let Me scroll down we created a table called employees and then we learned how the distinct keyword works we also learned how to use isnull with wear Clause we learned about the order by Clause we saw how to alter or rename a column then we explored a few more examples on wear Clause we learned about and and R operator then we learned how to use limit and offset as well as the fetch operator or the fetch keyword in postr SQL moving further we learned about the like operator in SQL which was used to perform pattern recognition or pattern matching you can say here we saw how to use basic inbuilt post SQL functions like sum average minimum count maximum next we saw how to update a value in a column using post SQL update command we learned how to use Group by then we learned how to use having Clause then we learned how to use case expressions in post SQL so we saw how case expression is similar to our ifls in any other programming language we explored a few mathematical and string functions and finally we wrote Our Own user defined function so that brings us to the end of this tutorial on postris SQL now if you want to get this SQL file that we have used in the demo you can give your email IDs in the comment section and our team will share this SQL file with you over email so what exactly is a CT you ask now if you are a beginner in sequence let’s say you wanted to Club two different tables or more different tables maybe three or four right so you will be using one keyword which is join right and let’s say you may have to create a query in such a way that you have to Club different tables and you have to extract the results from one table into another and finally create a output table right so this might be sounding a little too complex so basically Al what CTE does is it acts as a temporary table right now you can write a query and save it as a CTE right and that particular resultant table from CTE will not be created but will be in the memory as a temporary data or an intermediate resultant data right now whenever you want to use a join or whenever you want to use the same query inside a bracket or inside something in your query you can just simply use the name of the CT and then the data you require the columns you require and done you will get the data now this might be a little too complicated to understand in just m words now let’s just go through the formal definition of what exactly is a CTE and what it does and then let’s quickly so it’s a little too complicated to understand it just with m words so let’s get started with a practice iCal examples but before that let’s understand a formal definition of what exactly is a CTN SQL there CTE also known as or also called as the Common Table expression or some people also are used to call it as a width expression so the keyword is withth so a comma table expression in SQL is a temporary result set that you can Define within a query as I said it helps to break down the complex queries make the the code more readable and allows you to reuse the result set multiple times within the same query just you need to use the name of the CTE in the places where you want in your query and it reduces the code length as well as the execution time now CTE are defined using the width keyword as we discussed before followed by the CTE name right so for every uh column name or anything in your set you give a name right similarly when you are using CTE in SQL you also need to give a name to the CTE and that particular name will be used in your subquery positions that will reduce the query length and execution time so you should be giving a name and the query that generates the result set the CTE is available only during the execution of that particular query or specific query right so as I said the CTE table the resultant table which is created while you are using the CTE will not be created as a permanent table in the datab base right it will be a temporary or intermediate result which will be active as long as your current query which is using the CTE is active now let’s go to the demonstration mode then we will try to create some simple queries right and and we will understand how exactly a CT can be beneficial in those situations now let’s go to the MySQL workbench this is my my SQL workbench so I have a lot of tables here we have the credit card data set we have the sakila data set SLP data set says Superstar World Etc right so we will be using the Superstar data set uh I mean the database so firstly we need to write in the query which means that I am going to use the superstore database right sorry Superstar so I uh prefer using uh smaller case or lower case for database names and column names and uh uppercase for the keywords for example here use is uppercase and Superstar is the name of the keyword right so uh that is for uh identifying or easy readability which is a keyword and which is a name right so let’s execute this query and have access to the superstore data set and in Superstore data set I have one table called as Excel data now let’s quickly check what we have in Excel data select star from Excel data here we have R the order right ID audit dat ship mode customer everything right so we have region we have uh sales quantity discount profit rate so we have a number of possibilities and number of reports that we can generate but let’s try to keep it simple let’s try to find out unique regions right select unique of regions right or just regions Exel data Group by regions so let’s quickly execute this statement and see the output or maybe we can make some modifications to it right instead of that you might want to use you might want to use distinct function so that you don’t get all 10,000 plus rows so basically uh this particular data set has about 10,000 or more rows in it and it’s not real it’s completely made up report using artificial intelligence so we use chat GPT to create 10,000 rows of data for 30 years maybe from 2000 or 2001 to up to 2030 or 31 right so we don’t want all those 10,000 plus columns sorry rows so let’s use a distinct here and uh try to exit this statement so that we get U five uh of uh the regions what we have so there you go we have um five regions as expected NorthEast Southwest and Central now you can also uh select uh kind of maybe average of sales uh maximum of sales and total sales so so uh this is bringing us somewhere we can you know try to find out regionwise sales right so region wise sales Group by years like 30 years what was the sale happened in the year 2021 sorry 2001 right we begin from 2000 or 2001 to all the way up to 2030 or 2031 right so we can see if there is an increase in the year on-ear sales a decrease in the year on year sales you can identify the best performing eror the worst performing eror right so uh this sounds like a good use case now let’s go to the code where I’ve written it as a CTE and understand the workflow so here I have uh named my CTE as sales CTE so I’m starting it with the keyword with right so with sales CTE as now this is our query right what am I doing I’m extracting so according to the data set we have the date right so the date is year month and date of that particular day right but we want just the year so we’re using the ear function to extract the Year from the date as sales ear region uh sum of sales right we want you to find out the total sales happened in that particular year as total sales from the data set Excel data and appr by a right we wanted it in increasing order so 2000 to all the way up to 230 or 20031 correct so that’s how it is and I’m saving all this as a CTE named as sales CTE now I want to select some parts of that particular CTE so I want to select sales air region total sales from sales CD which is right over here and and order it in form of sales and region right now let’s try to copy this code and run this in our workbench right now let’s Okay let me close this quickly so that we have a complete view of the code right so let’s now select all the code so now we have selected this particular code let’s try to run this and see the output there you go we okay we have the output but there is something wrong we did not get uh the ears right so all the 30 years of data here it is Group by region which is fine okay we don’t want region we want to group it by ear that’s okay and the thing is we need to fix this particular year so maybe there is something wrong with the ud date right so I think uh the database has saved this particular oh okay okay since this is generated by chart jpd maybe the data type of the date is other than data right other than date data type it may be string now we might have to do some type casting to change the data type of the audit date and let’s quickly do that so now we have updated the ear so so what we have done is just uh cast here right so we have changed the string type of date to the normal date which follows by year month and date so this is a simple type casting that you can do and uh rest everything and we’ve also added a where condition so where uh date is not added or date is equals to null then you can just uh ignore that and uh now let’s try to execute this query and we have also removed that region thing right Group by region or order by region so we want that to be ordered according to the uh year which should start with uh 2001 or the first ever year at to the last ever year according to dat set now let’s select the entire CT query and run that and check our outputs so there you go so you have the year on-ear sales from 2001 to all the way up to the year 2030 and 2031 right so that’s how the CTE or common uh table expressions and SQL or the width query in SQL can be used so welcome to the demo part of the SQL project so in this we will do digital music store analysis okay so this SQL Pro is for the beginners so what you will learn from this uh project main thing is like so what’s the objective of this project this particular project so this project is for beginners and we’ll teach you how to analyze the music playlist database and you can examine the data set with SQL and help the store understand its business growth by answering simple questions so as you can see I will show you so I have three set of questions first one is easy okay and the second one is moderate and the third one is advanced level so we have three set of questions easy set moderate and the Advan okay so every set is of three three questions I guess yes in every set there is three three questions so okay in easy one there are five so we have 5 + 3 8 8 + 3 11 we have 11 questions to solve okay from this you will understand how you can you know analyze data with SQL how you can extract something from database how you can store something like this okay so and one more thing I will show you the schema of the particular uh data set which we will you know soon we will will restore so we have the tables in this artist album track media type genre invoice line invoice customer employee ID playlist playlist track and all okay so this is the music playlist database schema so without any further Ado let me create one database so here just right click create database okay here I will write music okay and save so now our database is created okay so if you will go to schema and if you go to tables there is no tables in it means there is no there is database but nothing is there the database is empty so now what I will do just go to your database just right click here you can see the restore option okay restore so format as it is then here file name go to this music store database I will put this database Link in the description box below don’t worry open then restore process started process complete some people will face this uh that the process is failed or something okay so for that what you have to do just go to file then preferences here you have to set the path just binary path okay see I am using the 15th version okay so I have set the path here also and this also but you have to set this path is important okay if you will not set this edv Advanced server path it’s fine but this part is most important okay but for the future reference I have added on the both what you have to do you have to just see where you will find this path just go to this PC then OS then program files here you will find this post G SQL then I’m using C5 15 then bin so you have to copy this path right you have to just copy it and paste it here and then select this one after that just save you won’t find any fail thing okay the process will complete right so now let’s move forward and see the tables okay it’s still empty while just refresh it see now you can see all the you know columns in my tables okay so what I will do for the checking I will run one query here okay let me close it okay I will write here select star from album okay let me run it so now as you can see here my table is working fine everything seems good okay so now what we will do we will solve question one by one okay so the first question let’s see the first question easy one who is the senior most employee based on job title okay who is the senior most okay so I will write here like the first question is who is the Senor most employee based on job title okay so this is our question so we know we have the table name called employer so we will select that table first so I will write here Select Staff from employee so you should know uh which table should to select okay so here as you can see in this question there is you know t uh word employee who is the senior most employee based on the job title most employee means means employees and employee table right so I will run it okay so what I will do I will just select this and run it okay so now you can see there in employer there is employee ID last name first name title report levels bu date higher date and all the details of the particular okay so we will do so there is one more thing you can see the levels okay level one level two so we have to who is the senior most employee based on the job title so what I will do I will write here order by levels and decreasing order okay so first I will do so now you can see the levels are in the descending order from senior to this okay L7 to L L1 so what we want we want only one uh employee name so what I will write here limit is one okay I will copy this and done it okay so now you can see the last name is Madan moan sorry moan Madan this is last name this is first name so Mohan Manan is the senior most employee based on the job de so question first is done so the second question is which countries have the most invoices okay first I will write down the question which country has the have the most have the most invoices okay so for this what we have to do see just first check first we have to check from which table you know we will get the solution so here you can see the word invoices Okay so we have one table invoice and invoice line we have to select it from this okay so I will write here Select Staff from invoice okay so we have customer ID invoice date billing address billing city billing State and everything so here you can see we have the billing country as well okay because we need the country name so we will take this column right so I will write here so I will write select we change it select count star from the select star billing country from invoice Group by billing country so why I’m doing this group by because as you can see uh we have USA multiple times USA USA USA then Canada also and the other countries as well so from this I will get only the one okay I will group them and I will get the one fine so from this we will uh get the count so after this I I will write order by so here I will write see see descending okay let me run it so now you can see the billing this is the billing order okay so or you can see the on voices USA got the 131 and Canada 76 Brazil 61 if I will write here again the limit one what I will get see USA we got the USA so us is the country which have the most invoices okay if you will remove this limit so you will get the other country as well second in Canada third is Brazil and like this okay and the third question is what are the top three values of total invoices okay again we need the same table okay first I will write the question third question is what are are the top three values of total invoices top three value of total invoices okay I know I can just solve this question by the second one but I want to do it from the starting okay so first I will take select stuff from invoice let me run it so first we will sort the data here I will write order by total because the last you know this is a table name okay total and the descending order so first I will select so we need just the top three so first I will do everyone know limit three okay okay so here I have done I have wrote this star that is why it’s giving me the all the values if I want this only this value so I can write select total from invoice order by this okay I will say and to run it so I have this total like 23.75 999 and 19.8 and 19.8 so these are the top three values of total invoices okay so here the fourth question is which city has the best customer we would like to throw a promotional music festival in the city we made the most money write a query that returns one city that has the highest sum of invoice Total return both the city name and some of the all invoice total so let me write the question first okay I’m writing question for you know your better understanding okay question fourth which city has the best customers we would like to throw a party uh promotional promotional music festival in the city we made the most money we made the most money write a query that it does one city that has the highest sum of invoices has sum of invoices total both the city name and sum of all invoices okay so we have this question okay so which city has the best customer we would like to throw a promotional music festival in the city we made the most money write a query that returns one city that has the highest sum of invoice Total return both the city name and the sum of all the invoices okay so first what we will do we will select select stuff from invoice okay sorry we select this okay so first we will select the billing city we have to focus on this and the total in this this two table we have to just focus on okay so here I will write some of total as invoice total comma billing city from invoice so this time we will do group by pilling City because we need the city names uh then I will addite order by invoice total and the descending order seems good select some total as invoice Total Building City from invoice building okay so let me select this so as you can see the highest billing city is parag Prague and the best customer is from the parag city okay so this city has the best customer obviously parag pragu or sorry for the you know mispronunciation okay so this is how we have solved our fourth question as well okay because WR both the city name and the sum of all the inv you know these is the city names and then inv voice total okay moving forward to our fifth question which is again the long one who is the best customer the customer who has spent the most money will be the declared the best customer write a query that Returns the person who has spent the most money okay so I will write here who is the best customer the customer who has spent

    the most money will be declare the best customer so write a query right that returns that Returns the person who has spent most money okay yeah so who is the best customer the customer who has spent the most money will be declared as the best write a query that Returns the person who has spent the most money okay so for this we have to take this customer Data customer table data okay so I will write as select stuff from customer okay I will select this and I will run it okay so this is our know data table data of customer okay so we have the country facts emails state city address last name first name okay so as you can see there is nothing uh like no detail of invoice or the money okay which have spent by the customer so what we will do we will look at our schema so now what we can do if we can’t solve a particular question from with one table we have to you know join the table to the other table so here we have to join customer table to invoice table so in this you can see there is customer ID and here also customer ID so on the basis of customer ID we can join the join both the table and with the help of this total we will sort out the uh that guy okay that customer right so for this I will write here select customer Dot customer ID comma customer Dot first name comma customer dot last name because we need need the full name of that guy comma sum invoice do total as total okay okay let me can P okay I don’t need the search pad right then I will write from customer okay my bad then join invoice on customer Dot customer ID equ alss to invoice do customer ID then I need Group by okay Group by customer Dot customer ID after this uh uh let me order it by the descending order so the most you know spend customer will come up so I will write here order order by total the descending order then limit equals to one fine let me run it let’s see what output should okay okay some error is coming okay sorry okay so as you can see the customer ID is five first name is R the last name is m our m is spent the highest value 14454 0 and two so who is the best customer M our mad sorry my bad our ma right our mother has spent the most money okay so this is how we are done with our easy set of questions now let’s jump into the moderate one okay so let me write the question first for the moderate so I will write here moderate questions so these analytics skill help you in the data analytics to become a data analyst or to become a data scientist okay so the question first is write query to return okay write query to return the email qu to return the email comma first name first name comma last name and genre of all do music list okay then return your list ordered alphabetically by email starting with a okay let yeah so for this okay let me open this first yeah okay fine so first what I will do so now in this question as you can see we need the we have to return the email first name and the last name and the J of all rock music listeners so if you will see select stuff from customer okay let we run this and if I will see there is no column name genre okay if I will show you the schema of this see the genre is here and the customers is here okay we need the first name last name and the email ID and the genre okay and the genre is Will should be Rock okay so what I can do I can connect this genre with track that because here is also track ID and here is also track ID then track ID to invoice line then invoice line to invoice then invoice to customer with the customer ID okay this pattern I have to follow right so for this I will write select just copy this okay just follow the steps select distinct email comma first name comma last name from customer join invo voice on customer Dot customer ID equals to invoice do customer uncore ID then join invoice underscore line on invoice do invoice ID okay then invoice _ ID then where check ID should be in here I will do select track ID okay from track then join then join genre yeah on track dot genre dot ID equ to genre dot genre ID where this is important genre name like rock because we need as you can see right a quy to return this is this and genre of all rock music listeners okay rock right then order by email okay before that let me show you this track okay select star from track okay let me show you this table you can see the name the track ID album ID Media type genre ID okay then the composer this this this bites and the unit price right okay so you know this we have then this customer okay invoice ID we to right fine so now what I will do I will just select this and okay invoice ID okay inv voice it is ambiguous here I have to write invoice line do invoice ID okay let me now run let me run it okay one more join genre on track. genre ID there is entry for table genre but it cannot be referenced from this part of the query okay okay as you can see the have the table name is Jore ID that was the mistake okay one more JRA do name spelling mistake sorry my bad guys no shes it happens okay now you can see we have all the people who love rock music and we have the email then first name then the last name see Adan Mitchell Alexandra Roa a grber like this cam Dan Edward like this okay so there are total 59 people who loves rock music from this particular database okay now question two question two is let’s Okay first let me show you let’s invite the artists who have written the most rock music data set write a query that Returns the artist name and the total track count on the top 10 rock bands okay so let’s invite the artist who have written the most rock music in our data set so write write a query that Returns the art this name and and the total count of track count of top 10 rock band so now what we need here okay let me do this so what we need here so let’s invite the artists who have the written the most rock music first we need the artist okay and the second is rock music then we need track okay and the total count total track count means we will get from the track so here we have track column track uh table and we have the artist so now let’s see the schema part so we have we need genre okay for the you know uh rock music then we have to combine this with the track ID because JN R is there from track ID to album because we need the artist name see artist ID and artist ID so this is how we have to connect the table now so for this I will write here just follow the steps select artist dot artist ID comma artist dot name comma count artist. artist ID as number of songs because we need the total number okay who have written the most rock music number of songs find from track now we have to join album on on album do album ID equals to track dot album ID okay then we have to join the artist with artist ID so join artist column on the basis of artist AR equals to then album to album. artist ID okay so here I have joined the artist to the album colum table okay then I have to join johra to the track table with the track table okay so here I will write join genre on genre do genre ID equ to JRA ID okay sorry track ID track. J okay so here I will write where where genre dot name name like shock okay rock fine then I will Group by my B group by artist do artist ID I need the ID as well then order by order by number of songs the descending then limit I need only 10 rock bands limit will be 10 let me run it okay let me run it okay album okay now let me run it okay now you can see this guy let zeppin AR side is 22 and wrote the most songs 144 then U2 122 d purple 92 then then this this this then this okay so this is how we solved our second question right so now the third question okay return all the track names names that have a song Length longer than the average song Length return return the name and the millisecond of the each track order by the song Length okay so first I will write this question Q3 so return all the track names that have a song Length longer than the the average song length Okay as we return all the track names that have the song Length longer than the average song Length okay then return the name and milliseconds for each track fine after that order by them order by the songs with the longest s listed first okay fine so we have to return all the track names that have song length and the okay first we will find the total length of this songs then we will do the where then we will put the V Clause to find out the particular uh longest song okay so this is this we’ll do in the two you know step first you will find the average strike length Okay so I will write here select select name comma millisecond okay from track where milliseconds here I will write select average from the millisecond okay then I will write here as average track length Okay then here I will write from track after this I will write here order by milliseconds I need in the descending order okay so let me run it so now you can see see first I will uh read it again so return return all the track names that have a song Length longer than the average song Length return the name and the millisecond for each Strat order by the song with the longest song listed first okay so this is the longest song okay so we have all the songs which are the longer than the average song Length right so now moving forward we have jumped into the advanced set of questions okay so now we will do the advanced questions okay so let’s see first find how much amount spent by each customer on artist write a query to return customer name artist name and total spense okay so first we will write down the questions okay then question one question one okay find how much amount spent by each customer on art is just write a query to return customer name comma artist name comma total spend and total spent okay so how to solve this so first find which artist has earned the most according to the invoice lines okay first uh let me show you the schema okay we need the artist name we need the customer name and we need the total spend okay with the invoice line because the quantity should be there okay so first we’ll see how to join these three table artist table customer invoice and invoice line like this okay this is how we will you know join the table fine so now I will tell you the you know steps so first find which artist has earned the most according to the invoice line okay the second now use the artist to find which customer spend the most on the artist so for this query you will uh be need to use the invoice invoice line track customer album and the artist table so just remember this one is tricky because the total spent in the invoice table right let me show you so total spent on the invoice table might not be a single product so that is why I was saying we need the quantity so we need the invoice line table to find out how many each product was purchased then we have to multiply this by the price of each artist okay fine so now so this is the lengthy one I will just you know write it for you and get back to you yeah so this is how you can see okay Group by five I have wrote this you can just you know write it okay like this okay we took artist name then sum of invoice line unit price into invoice line the quantity that I showed you okay we have multiply this total with the quantity okay then we join the table track with invoice album with track artist with album okay so now let’s run it yeah so now you can see this H or queen amount spent 27 the customer ID is this okay then Nicholas scer then 18 okay we have the the everything okay customer name artist name and the total spend this is the customer Name the artist name and the total amount this spent fine so now let’s move forward to the next one which is okay okay yeah so the second one is this we want to find out the most popular music genre for each country we determine the most popular genre as the genre with the highest amount of purchase so write a query that returns each country along with the top genre for countries with the maximum number of purchases shared return all the genres okay so what I will do first I will write the question okay so we okay question two so find how the most popular music music genre for each country okay with the mine determine the most popular genre as the genre with the highest amount of purchase okay then write a query that returns each country along with the top genre for countries where the maximum number of purchase okay so so there are two parts in this question first the most popular music genre and the second is the need of data at the country level okay so we can do it from the two methods okay using CT and the using the recursive method so I will use the using City I will do this city so for that you have to write with popular genre as select count invoice line dot quantity okay as purchases comma customer dot country comma genre do name comma genre dot genre ID okay then here I will write row underscore number number then I will write over Partition by customer. country order by count voice line dot quantity okay into descending order as row number okay so from invoice line okay yeah so here I will join the tables join invoice on invoice do invoice ID equals to invoice line do invoice ID okay then again join customer on customer. customer ID do idore idals to invoice Dot customer ID fine then again we have to join track track on track. track ID equals to invoice uncore line Dot trackcore ID then join genre on genre. JRA ID okay then track do John Ry okay then I will do group by Group by 2 comma 3 comma 4 then I will do order by two then ascending order and then one to descending order okay okay then now I will write select star from popular genre where row number less than greater than one okay now let me run it so now you can see we have okay I will let me read so we have to find the most popular music genre for each country okay so now we have the Contin margentina the most popular is alternative and punk John R is this store number is this okay purchases this then the Australia this rock rock rock rock rock okay certain Rock USA Rock and everything is there right so this is how you can find the most popular music genre for each country okay the last question is the here now the last question is write a query that determines the customer that has spent the most on the music for each country write a query that Returns the country along Ong with the top customer and how much they spend okay for the countries where the top amount of spent is shared right and they provide all the customer who spend this amount okay so for this um this is like a similar to this question okay so there are two parts in this question find the most spent on music for each country and the second is the filter the data for the resp customer it’s very easy okay so okay I will write the solution okay you can check the question from there I’ll write customer with country as as uh I will here select customer do customer ID comma first name comma last name comma billing billing country comma sum should be total as total spending right then I will write zow number same over we have also written here now right the same we have to write here over then Partition by billing country order by by some total descending order as row number okay so after this I will write here from invoice you have to fetch then again the same thing we have to join the table join customer on customer Dot customer ID equals to invoice do customer ID okay then here I will write Group by by 1 comma 2 comma 3 comma 4 comma okay that’s it okay then I will write here order by four ASC ascending order comma five to descending order fine so now I will write here select start from customer with country where row number is one fine so let me run it see we have first name last name billing country total spting R number and the customer ID let me show you the question here write a query that determines okay let me make it okay yeah so write a query that determines the customer that has spent the most on the music so customer we have the customer name for each country write a query that result the country along with this so we have the country name with the top customer how much they spend we have the total spending for the countries where the top amount is shared provide all the customer who has customer who spent this amount okay so we have everything here right we have this Le from Brazil this this this this this okay with the customer so this is how you can solve these questions so till now I can say you have a good data analytics skills so for this I can say this will help you in the interview of data analyst in data science or any SQL okay picture this you are in the interview and the interviewer ask can you write a query to find the top five sales records you freeze for a moment thinking am I ready for this or not don’t worry SQL might sound complicated but it’s actually a super useful tool that lets you interact with databases have you ever wondered how all those apps and websites stores and organize their data well that’s where SQL comes in SQL which stands for structured query language is a universal language for talking to databases it’s super powerful and lets you do things like pull out specific information add new data update existing stuff or even delete things you don’t need it’s basically your magic key to manage huge amounts of information with ease exactly and if you’re aiming for a career in Tech whether it’s a database administrator data analyst or software developer SQL is a must-have skill databases are at the heart of almost every application so knowing SQL can unlock some really exciting opportunities now here’s the exciting part this video is your secret weapon to master SQL interviews we have packed it with 45 carefully chosen SQL interview questions that everything you need to raise those tough questions so we’ll be starting with the basics like how databases work and then diving into advanced query challenges and by the end you’ll be fully prepared to tackle any SQL question thrown at your way so let’s dive in and get you closer to your dream job so let’s get started so now let’s start with a SQL interview question we’ll cover every question starting from basic level to advanced level so now let’s look at our first question which is very basic what is SQL so we all know that SQL stands for structured query language and it is the language which is used talk to databases think of it like giving instructions to a computer system that stores and organizes data for example if you want to find out all the customers who ordered a specific product then SQL can help you do that with a simple command you can also use SQL to add new data like entering a new customer details into database if you want to update someone’s phone number SQL has got you covered or maybe you want to delete old records that are no longer needed SQL can handle that too here’s a quick example if you want to find all the customers in New York you could write something like select star from customers where city is equals to New York so we are using this command to find all the customers in New York remember if you want to find out all the specific data from the table you have to use the Star Command and if you want to add new customer you can just simply write in insert into customers name City and then you can also insert value name which is John or and you can just enter the specific location so SQL works the same way across many popular databases like my SQL post SQL or SQL Server which is why it’s such an important skill for anyone working with data so now let’s look at our second question which is what are the different types of SQL commands so SQL commands are like instructions you give it to a database to tell it what to do there are different types of commands and each one has a specific purpose so let me explain these in simple terms remember if an interviewer asks you such questions simply explain using the proper keywords and uh use proper definitions and you know easy language that’s it so the ddl command we have which stands for data definition language it basically defines the structure of the database for example if you want to create a table or if you want to alter a table or if you want to drop a table then we have DML which stands for data manipulation language it deals with the actual data in the database for example insert update delete all of these things then we have DCL which is data control language it will manage permissions and access control so if you want to manage permission or access control then you have to use this particular type of SQL command which is DCL which is Grant and Revo Grant will provide access rights and ROK is used to remove the access rights we also have TCL which stands for transaction control language it will manage transactions in the databases for example commit commit is used to save changes rback is used to undo changes and save point is used to create intermediate points in a transaction so for instance in a schema with customers table and an order table data definition language commands are used to Def find the table whereas data manipulation commands which is uh select insert and delete it is used to update customer or order data DCL is used to control access and TCL which is transaction control language is used to manage transactions that’s it it was very simple now let’s look at your third question which is what is a primary key in SQL a primary key in SQL is like a unique ID for each record in a table think of it as a way to ensure that no no two rows in a table have the same value remember that no two rows must have the same value it is also a rule that the primary key column can’t have empty or null value so these are the basic criteria for a key to be a primary key for example in a table of customers you have a column called customer ID as the primary key then each customer have a unique customer ID like 1 2 3 and so on this makes it easy to identify Y and retrieve specific customers from database here’s a simple example suppose we have created this table create table customers and we have given the customer ID as primary key we have given name as Vare the city name as well so the primary key will ensure that each customer ID is unique no duplicates are there no customer ID is left blank that is there should be no null values and primary keys are also important when linking tables together for example if you have orders table you can use the customer ID as a reference to connect each order to a specific customer this help maintain data Integrity across the databases now the fourth question is what is a foreign key a foreign key in SQL is like a connection or link between two tables it’s just like a field in one table that refers to the primary key in another table this creates a relationship between the tables and ensures that the data stays consistent for example let’s suppose you have two tables a customer table with a primary key called customer ID and a sales table with a field called customer ID which is a foreign key linking back to the customer ID in the customer table so here’s how it will look like we have created this table and you can see in this example that suppose we have customers and sales table so we have used this customer ID in our customer table as well so here customer ID is the foreign key and in the customer table the customer ID which we had used is the primary key so now let’s move on to our fifth question which is delete and trunk it command what is the difference between delete and truncate commands so delete and truncate commands in SQL both remove the data from a table but then they work in different ways let me break it down for you the delete command so what delete command will do is basically this is used when you want to remove specific rows from a table based on a condition for example if you want to delete all all the customers from a specific City then you have to use this delete command it will allow you to be selective but it’s slower because it logs each row deletion which also makes it possible to roll back the changes if needed if you’re using transactions moving on to trunade command trunk gate will remove all the rows from a table at once without allowing any condition for example if you just want to remove all the rows just in a one go you have to use this truncate command just simp write this query as trunk table suppose the table name is customers is much faster because it doesn’t log individual road deletions and it simply clears the entire table in one go however you can’t roll back a trunk operation in most databases once it’s done so the key differences is delete is for specific row trunk it is for clearing the entire table trunk it is much more faster because it uses fewer system resources delete can be rolled back if used within a transaction trunade usually cannot delete logs each rout deletion and trunade doesn’t so this was all for this trunade and delete commands so anytime if the interviewer asks you such questions just simply explain this now let’s move on to the sixth question which is what is a joint in SQL and what are its types so this is one of the most important question you’ll be getting to know in the interviews you’ll be asked such questions in the interviews so basically a chin and SQL is used to combine a data from two or more tables based on related column like a column key that links them together it’s just like connecting pieces of puzzles join help you see the bigger picture by merging related data for example if you have a customer table and a sales table you can use the join to see which customer placed which order by linking them through a common column such as the customer ID so you all know what a join and SQL and let’s discuss its type so joint types are basically there are four types of joint which is inner joint left joint right joint and full outer joint as well so what inner joint will do is it will combine rows from both tables where there is a match in the common column think of it as the overlapping section in the v diagram only rows that exist in both the tables are included the left join or we can call or the left outer joint it will retrieve all the rows from the left table and the matching rows from the right table remember the left joint will retrieve all the rows from the left table and only the matching rows from the right table if there’s no match the result includes null values for the right tables column think of it as including the entire left Circle in the vent diagram along with any matches in the right Circle right join or the right outer joint is similar to the left joint but it will retrieve all the rows from the right table and matching rows from the left table if there’s no match null values are included for the left table column think of it as including the entire right Circle in the vend diagram along with any matches on the left Circle then we have the full joint full joint will combine rows when there’s a match in the either table if no match is found it includes null or the missing values from the either table think of it as combining both circles in the v diagram everything from both tables are included now let’s move on to the seventh question which is what do you mean by a null value in SQL it’s very easy null value in SQL means that a column has no data it’s missing or unknown it’s not the same as an empty string or the number zero those represent actual value while null represent no value at all for example if you have a table of customers and one of the row doesn’t have a phone number then the phone number column for that row would be none see we have listed this in the table here you can see if you do not have any data just simply write none so the next question is Define a unique key in SQL so unique key in SQL ensures that all values in a column or a combination of columns are unique that is no duplicates are allowed it’s like a having a rule to make sure that no two rows in a table must have the same value in that column for example in a user table the email column can have a unique key to ensure that MO two users can register with the same email address remember the key points unlike a primary key a table can have more than one unique key unique Keys allow null values while primary keys do not so these are very important to remember so if you’re ask a difference between a primary key and a unique key just simply say that unlike primary key a table can have more than one unique key and unique key allows null values while primary key doesn’t so here is a table we have listed below that is create table users and we have given user ID as integer primary key and then we have email Vare as a unique key so here the email column is a unique key so that each email must be different it should not be the same now let’s look at our next question which is what is the database a database is organized way to store and manage data think of it like a digital filling cabinet where information is Neely arranged in tables with rows and column each row represents a record and each each column represents a specific detail about that record for example a database for a library must have a table for books then the rows could represent individual books and the columns could include the book title author and the publication year the main purpose of a database is to make it easy to store manage and quickly retrieve data whenever you need it databases are used in everything from apps and websites to banking system and e-commerce platform now let’s look at our question number 10 which is explain the differences between SQL and no SQL databases so here’s a simple explanation SQL databases are structured which means they can store in tables with rows and column like a spreadsheet they follow predefined schema meaning the structure of a data is fixed and you need to Define it before adding any data these databases are great when you need consistent and reliable data like for banking system or inventory management examples are MySQL Oracle Ms SQL post SQL Etc and SQL databases are also known as rdbms which is relational database management system let’s talk about nosql databases so nosql databases are flexible and do not use stable instead they can handle unstructured or semi structured data so no SQL database is dynamic where data is primarily stored in Json objects key value pair graph nodes Etc they don’t have a specific ific structure the such databases are mostly not preferred for performing complex query operations and the examples include mongod DB couch DB elastic search Etc so now let’s move on to question number 11 which is what is a table and a field in a SQL so a table is like a spreadsheet that stores data in organized way using rows and columns each table contains records and the details for example a table name employees could store information about employees in a company whereas a field is a column in a table and it represents a specific attribute or property of the data for example in the employees table Fields could be employee ID name and department so here’s a simple example of the table you can see here the fields name or we can say column and we also have the records or the rows we can see that so in this table the entire table is called employee each row or record stores information about one employee and each column of field represents specific details like employee ID name and the department now let’s move on to a question number 12 which is describe the select statement well the select statement in SQL is used to retrieve data from a table or multiple tables it’s like asking databases show me this specific information here’s how it works you can specify which columns you can see for example to retrieve all customer names from a customer table you can select name from customer if you want to retrieve all the data just write this query select star from customers remember I told you before in the first question we using start if you want to retrieve all the data from the table and if you want to just retrieve from a particular row or a column just simply write select name from customers the field name you can also apply filters you can use a wear Clause to filter the results for example you can write select name from customers where the city name is New York you can also sort the results use the order by to sort the data for instance to sort customers by the name you can just write select name from customers and then order by name ASC ASC means ascending order so in short the select statement lets you choose what data you want to see now let’s talk about what is a constant in SQL and name a few so if you ask this question just simply answer a constant in SQL is a rule applied to a table that ensures the data stored is accurate and consistent it also help in maintaining data Integrity by restricting what values can be added or modified in a table here are some common constraints primary key we have foreign key then we also have unique key check not null and default so we have already discussed about primary key it ensures that each row in a table has a unique identifier and the column can’t contain null values foreign key links to a column in one table to a primary key in another table to maintain relationship unique key ensures all the value in a column are distinct that is there are no duplicates the check ensures that data meets a specific condition before being inserted or updated not null ensures that a column cannot have null values the constraints are essential for maintaining reliable and valid data in your database now let’s talk about what is normalization in SQL normal I ization in SQL is a process which is used to organize data in a database to make it more efficient and reliable the goal is to reduce R deny which is duplicate data and Ure data consistency this is done by splitting a large table into smaller related tables and then linking them using relationship like primary and foreign key for example imagine a single table that stores customer details and the orders if the same customer places multiple orders then their information like name and address would be repeated for each order now using normalization you would separate this into two tables first we would have customer table which will store customer details like customer ID name and the address and then we have order table which will store order details like order ID customer ID and the order table now by linking these tables using customer ID you can also reduce duplication and ensure that any changes to customer details are updated in just one place now let’s talk about question number 15 which is how do you use the wear Clause it’s very easy so just answer the we Clause within SQL queries serves the purpose of selectively filtering rows according to a specified condition thereby enabling you to fetch exclusive those rows that align with the criteria you define for example select star from employees where department is equals to HR now let’s move on to question number 17 which is difference between Union and Union or so Union is used to merge the contents of two structurally compatible table into a single combined table the difference between union and Union all is that Union will omit duplicate records whereas Union all will include duplicate records very easy Union will omit duplicate records and Union all will include duplicate records the performance of Union all will typically be better than Union since Union requires the server to do the additional work of removing any duplicates so in cases where is certain that there are not any duplicates or we having duplicates is not a problem then we can use Union all it would be recommended for performance so now let’s move on to the question number 18 so here a table is given below and you will have to see what will be the result of a query the query is Select star from Runners where ID not in select winner ID from races so the answer is given the simp data provided the result of this query will be an empty set so the reason for this is as follows if the set is being evaluated by the SQL not in condition contains any value that are null then the outer query here will return an empty set even if there are many Runner IDs that match winner IDs in the races table question number 19 is what are indexes in SQL indexes in SQL are just like having a shortcut to quickly find data in a table instead of searching through every Row one by one an index creates a sorted structure based on one or more columns making data retrieval much more easier for example you can think of an index in a book if you’re looking for a specific topic you can go to the index at the back and find the page number in stad flipping through every page similarly in database an index help the system quickly locate the rows you need so here’s how it works if you often search for customers by the name created an index will speed up those queries you can just write create index idx customer name on customer and then the customer name the database uses the index to find the row so you just have to run a query which is Select star from customers where name is John and then you can use the index to find a row with name is equals to John much faster let’s move on to question number 20 which is explain Group by in SQL the group by clause in SQL Will Group rows with the same values in a column allowing you to apply functions like sum count or average to each group for example in a sales table to find total sales by region you just simply have to write this query which is Select region some amount as total sales from Sales Group by region so the group the sales by region and calculates the total for each it’s a quick way to summarize data by categories so now let’s talk about question number 21 which is what is SQL Alias a SQL Alias is a temporary name you can give it to a table or a column in a query to make it easy to read or work with it’s like giving a nickname to something for clarity for example if you have a column named first name you can use an alias to rename it as first name in the query results you just simply have to write this query as select first name as first name in capital letter last name as last name from employe here the as keyword assign the Alias and the output will show The Columns as first name and last name aliases are also useful for tables so for this you can just write the code AS select e first name from Department table so this shortens table name for easier referencing alyses are not permanent they only exist while the query is running now let’s talk about the question number 22 which is explain orderby in SQL so you can answer this question like the order by clause in SQL is used to sort the result set of a query based on one or more columns you can specify each column sorting order ascending or descending for ascending you have to use ASC and for descending you have to use the ESC okay so just have to Simply write this query as select star from product order by Price DEC now let’s talk about question number 23 which is differences between where and having in SQL the where Clause is employed to restrict individual rows before they are grouped such as when filtering rows prior to a group by operation conversely the having Clause is utilized to filter groups of rows after they have been grouped like filtering groups based on aggregate values the having Clause it cannot be used without the group Clause whereas the where Clause specifies the criteria which individual records must mean the selected query it can be used with the group by Clause question number 24 is what is view in sec one more important question so and SQL view is essentially a virtual table that will derive its data from the outcome of a select query view serve multiple purposes including simplifying intricate queries enhancing data security through an added layer and enabling the presentation of targeted data subsets to users all while keeping the underlying table structur hidden now let’s move on to question number 25 which is what is a store procedure so if you asked this question just simply say a sequel stored procedure comprises of prec compiled SQL statements that can be executed together as a unified entity these procedures are commonly used to encapsulate business logic improve performance and also ensure consistent data manipulation practices that’s it now let’s move on to question number 26 which is one more important question which is what what is triggers in SQL a SQL trigger consists of a predefined sequence of actions that are executed automatically when a particular event occurs such as when an insert or delete operation is performed on a table triggers are employed to ensure data consistency conduct auditing and streamline various tasks so you can use insert trigger update Trigger or delete trigger accordingly now let’s talk about what are the aggregate functions and if you know them name a few it’s very easy to answer aggregate function and SQL perform calculations on a set of values and return a single result at first we have minimum which will get the minimum value from the resultant set then we have the max function which will give you the maximum value from the resultant set the sum will give you the sum of values from the resultant set average will give you the simple average of the resultant set and the count will count of numbers records from the resultant set now let’s talk about question number 28 which is how do you update a value in SQL the update statement serves the purpose of altering pre-existing records within a table it involves specifying the target from the update the specific columns to be modified and the desired new values to be applied for example if you want to update you can use Query like update employees set salary is equals to 6,000 where the department is ID now we’ll be moving on to some intermediate mediate SQL interview question and answers so one of the question is what is a self join and how would you use it I would like to repeat again these join types of question is very important these are often asked in interviews so talking about what is a self joint a self join and squ is a type of join where a table is joined with itself it’s useful for comparing rows with the same table or exploring hierarchal relationship such as finding employees and the managers in an organization so imagine if you have an employee table so you have employee ID name and the manager ID so if you want to find each employee and the manager you can use a self jooin you can just simply write a query as select e name as employee M name as manager from employee left join employees on manager so I’ve already discussed with you before what is the meaning of Left Right and self jooin so here the table joined with itself using manager ID to link each employee to the manager a self joint is helpful for comparing rows in the same table or working with hierarchial data so now let’s move on to question number 30 which is explain different types of joints with example at first we have inner joint the inner joint will gather rows that have matching values in both the tables then we have the right joint it will gather all the rows from the right table and any matching rows from the left table left join will gather all the rows from the left table and any matching rows from the right table and the full joint will gather all rows where there’s a match in either table including unmatched rows from both the tables very easy now let’s move on to question number 31 which is what is subquery and provide it using an example so subquery basically refers to a query that is embedded within another query serving the purpose of fetching information that will subsequently be employed as a condition or value within the encompassing out a query so you can just use this uh query which is Select name from employees where salary is greater than select average from salary from employees now the next question is how do you optimize SQL queries so basically the answer to this question would be something like SQL query optimization involves improving the performance of SQL queries by reducing resource usage and execution time strategies include using appropriate indexes optimizing very structured and avoiding cost operations like fully table scans now let’s talk about question number 33 which is what are correlated subqueries it’s a type of subquery that makes reference to columns from the surrounding outer query this subquery is executed repeatedly once for each row being processed by the outer query and its execution depends on the outcomes of the outer query now we’ll be talking about what is a transaction in SQL and it’s very important one of the most important question asked every time in SQL interview questions so basically a transaction in SQL is a group of one or more SQL commands that are treated as a single unit it ensures that all the operations in the group either succeed completely or fail entirely this guarantees the Integrity of the database imagine you’re transferring money from your bank account to a friend’s account that the bank first deducts the amount from your account account and then it adds the same amount to your friend’s account these two steps together form a transaction if one of these steps fails example the system crashes after deducting money from your account but before adding it to your friend’s account then the entire transaction is rolled back meaning no money is transferred and the database returns to its original state so you can also explain this question with the help of example that would be more you know clear to the interviewer now let’s talk about what are asset properties in SQL so basically asset stands for atomicity consistency isolation and durability and these are Key properties that ensures database transactions are reliable and maintain data Integrity atomicity you can think of it as All or Nothing a transaction is a single unit of work if any part of the transaction fails then the entire transaction is rolled back and no changes are made to the database for example if you’re transferring money between two accounts either both the debit and credit operations happen or neither does the second we’re going to talk about is consistency the database must always be in valid State a transaction takes the database from one valid state to another following all the rules and constraints for example if a transaction adds a record that violates a rule like a duplicate primary key then the transaction fails key keeping the database consistent isolation transactions don’t interfere with each other even if multiple transactions are running at the same time each transaction works as if it’s the only one happening example if you two people are updating the same record then one transaction will wait until the other is complete talking about durability once a transaction is committed it’s permanent even if there’s a power outage or system failure the data is saved and it won’t be lost after you complete an online purchase the transaction is stored securely even if the server crashes immediately after so this was for the asset properties and now we’ll be moving on to our next question which is how to you implement error handling in SQL error handling in SQL is a process to manage and respond to errors that occur during query execution different database system have specific ways to handle errors in SQL server the TR catch block is commonly used the tri block contains the main operation while the catch block handles errors if they occur for instance in a transaction you can use roll back in a catch block to undo changes if something goes wrong similarly in Oracle the exception block within PL SQL is used to handle errors if an error arises the exception block executes rolling back the transaction and the logging the error message by implementing error handling you ensure that operations fail gracefully without corrupting data making the database operations more reliable and secure next question which is describe the data types in SQL SQL supports various types of data types which Define the kind of data a column can hold these are broadly categorized into numeric character data type and binary types so we have numeric data types like integer float then we have character string like Car Bar we also have uni code character string like N N Text then we have binary which includes binary image date and time which includes date and date and time then we also have some miscellaneous data types which is XML and Json so the next question is explain normalization and denormalization often this question is asked in this way also or it could be asked something like explain the difference between normalization and denormalization so to answer this you have to just simply explain what normalization is which I have already discussed before once again I’m seeing you normalization and denormalization are ways to organize data in a database normalization is all about breaking big tables into smaller ones to remove duplicate data and improve accuracy for example instead of repeating customer details in every order you create one table for customers and another for orders linking them with a key denormalization on the other hand is when you combine or duplicate data to make it faster and retrieve for instance you might add customer details directly to the c table so that you don’t need to join tables during a query normalization help you space and maintain consistency while denormalization makes data retrieval quicker depending on what the database needs let’s move on to a next question which is what is a clustered index it’s very easy just simply answer by saying that a cluster index in SQL determines the phys physical order of the data rows in a table each table can have only one clustered index which impacts the table storage structure rows in a table are physically stored in the same order as the clustered index key now we have next question which is how do you prevent SQL injection so talking about this question SQL injection is a security risk where attackers insert harmful code into SQL queries potentially accessing or tampering it with your database to prevent this you can use parameterized queries or repair statements to handle the user input safely you can validate inputs to allow only expected values used store procedures to separate logic from data limit database permission non Escape special characters these steps help you secure that your database is free from SQL injection attacks the next question on the list is explain the concept of database schema in SQL a database schema functions as a conceptual container for housing various database elements such as tables views indexes and procedures its primary purpose is facilitate the organization and segregation of these databases elements while specifying their structure and interconnections next question is how we data Integrity insured in SQL just simply answer by saying that data Integrity in SQL is ensured through various means including constants example primary Keys foreign Keys check constants normalization trans actions and referential integrity constants as well these mechanism prevent invalid or inconsistent data from being stored in the database question number 42 which is what is an SQL injection we have already discussed about how we can protect our data from SQL injections so now let’s discuss what is basically a SQL injection so SQL injection in cyber security attack that involves insertion of malous SQL code into applications in input fields or parameters this unauthorized action enables attackers to illicitly access a database extract confidential information or manipulate the data the next question is how do we create a stored procedure you use the create procedure statement to create a stor procedure in SQL a stor procedure can contain SQL statements parameters and variables so here’s a very simple example you can just simply create by writing this query as create procedure get employ by ID add employee ID integer as begin select star from employees where employee ID is equals to add employee ID and then you have just have to write end that’s it so next question is what is a deadlock in SQL and how it can be prevented one more important question often asked an interview so you have to answer something by saying that a deadlock in SQL happens when two or more transactions are stuck because they are waiting for each other to release resources it’s just like two people trying to go through a narrow door at the same time each refusing to step back and let the other pass transaction a locks table one and weights to access table two transaction B locks table two and weights to access table one this is just a simple example so we can see that both the transactions are waiting for each other neither can proceed creating a deadlock so it’s very simple and how we can prevent this deadlock is by locking hierarchies always access resources in the same order so that transactions don’t block each other timeouts set a time limit for transaction to wait out for the resources you can also use deadlock detection and resolution system to detect Deadlocks and cancel one transaction and let the other proceed now let’s move on to a last question on the list which is difference between in and exist in basically works on list result set it doesn’t work on subqueries creating a virtual table with multiple columns Compares every value in the result list performance is comparatively slow of a large result set of subquery whereas the exist works on Virtual tables it is used with correlated queries exist comparison when matches found and the performance is comparatively fast for larger result set of subquery so guys that’s it for this video on the top 45 SQL interview question asked in SQL interviews ever wondered how seems to know exactly what you want before you do that’s the magic of data analytics imagine you’re shopping for a camera and suddenly Amazon suggests the perfect lens tripod and memory card all before you even think of them it’s not magic but the power of analyzing massive data sets to track what millions of Shoppers like you search for and buy together this helps Amazon create a personalized shopping experience that boosts sales and keeps your coming back from predicting Trends to fine-tuning their stock Talk data analyis is a secret Source behind their seamless shopping experience hey everyone welcome back to Simply n’s YouTube channel today we have got an exciting topic lined up the top 10 data analytics certifications I will be walking you through the expanding scope and financial growth of data analytics worldwide why pursuing a data analytic certification is essential and finally the top 10 data analytics certifications that can supercharge your carrier that can open doors to exciting opportunities let’s Dive In and explore the world of data analytics together now let us explore the expanding scope and the financial growth of data analytics the scope of data analytics is worst promising Financial growth and Rising salaries for data analytics scientists and Engineers as Industries digitalize demand surges and finance for fraud detection healthc care for predictive diagnosis retail for personalized marketing and Manufacturing for productive maintenance Innovations like augmented analytics and realtime processing enhances importance companies like Google Amazon Microsoft and IBM consistently higher analytics experts in India entry level salaries range from 4 to six lakhs with perom with experienced professionals earning 10 to 20 lakhs perom in USA entry level salaries are1 60,000 to1 80,000 with experience roles at do$ 100,000 to1 15,000 plus the future promises greater advancements making data analytics a lucrative field with work potential now let us see why is pursuing a data analytics certification essential pursuing a data analytics certification is crucial as it validates your expertise boost your credibility and lights up your resume in a competitive job market certifications provide you with in demand skills like data visualization statistical analysis and machine learning keeping you current with the industry Trends they can lead you to paying high paying job roles and career growth as employers favor certified Professionals for job data driven positions whether you’re starting or advancing your career or certification showcases your commitments and skills enhancing job prospects in fields like Finance Healthcare retail and Tech as well so all right guys the moment you have been waiting for is here it’s time to reveal the top data analytics certifications by simply learn buckle up and let’s dive into this carrier boosting programs that will set you on the path of success coming to the number one that is a post-graduate program in data analytics boost your career with simply Lars postgraduate program ineda analytics offered in partnership with bir University and in collaboration with IBM this comprehensive 8mon live online course is perfect for professionals from any background and covers crucial skills like data analysis visualization and supervised learning using python R SQL and powerbi the program features master classes by Purdue faculty and IBM experts Hands-On projects with real world data sets from Google Play Store lift and more and exclusive hackathons and AMA sessions receive joint certifications from Padu and simply learn IBM recognized certificates and benefit from carrier Support Services like resum building and job assistance through simply learns job assist no prior experience required just a bachelor’s degree with at least 50% marks is required enroll now to gain industry relevant experience and stand out to the top employers like Google and Amazon to check for the coast Link in the description box and pin comments below now moving on to the number two that is calch postgraduate progr in data science Advance your career with simply learns postgraduate program in data science in collaboration with calch ctme and IBM this comprehensive 11 month live online course covers essential skills and tools including python machine learning data visualization generative AI promt engineering chat juty and more with master classes by Caltech instructors and IBM experts you will G hands-on experience to 25 plus industy Rel projects Capstone projects across three romens and seamless access to integrated Labs on a tees program completion certificate and up to 14 counting education units from CTIC ctme along with the NY recognized IBM certificates enhance your career with job assistance master classes and exclusive hackathons with no prior work experience required this program is suitable for professionals from any background who hold a bachelor’s degree enroll now to become a data science expert and stand out to top employers to check for the course Link in the description box below and pin comments now moving on to the number third that is professional certificate programming data analytics and generative Advance your career with professional certificate program in data analytics and generative AI by simply learn in collaboration with E and ICT Academy IIT goti and IBM this comprehensive 11mon live online program is designed to equip you with cutting a skills in data analytics and generative AI covering essential tools like SQL Excel python W power VI and more learn from distinguished I faculty and IBM experts through interactive master classes Hands-On projects and Capstone experiences gain practical expertise with exposure to jni tools such as chaty and Gemini and earn industry recognized certifications from IVM along with the executive alumni status from I goti enhance your Professional Profile with simply learn job assess resume building and job placement support to get noticed by the top hiring companies enroll now to elevate your career and join network of Industry leaders do check for the co Link in the description box below and pin comments moving on to the number four that is professional certificate course in D s Master data science with a professional certificate course in data science by simply learn in collaboration with ICT Academy I kpur this comprehensive 11 month live online program equips with essential skills and tools such as python power BW chat jity and more benefit from the master classes delivered by distinguished IIT kpur faculty gain practical experience with 25 plus Hands-On projects and access integrated La for real world training with dedicated modules on generative AI prompt engineering and explainable AI you will stay ahead in the rapidly evolving AI landscape ear a prestigious program completion certificate from E and ICT Academy IIT kpur and take advantage of Simply Lars job asset to enhance your Professional Profile and stand out to recruiters apply now to enhance your career in data science and AI do check for the course Link in the description box below and pin comments now moving on to the fifth one that is the postgraduate program in data science supercharge your career with the postgraduate program in data science by simply learn in collaboration with bird University and IBM ranked as the number one data science program by Economic Times this 11 month live online program equips with with the in demand skills including python machine learning deep learning NLP data visualization generative Ai and chargeability benefit from the master classes led by Purdue faculty and IBM experts engageing Hands-On training with 25 plus projects and free Capstone projects and gain access to Industry leading tools such as T flow carers powerbi and more earn dual certificates from perue University online and IBM boosting your Professional Profile and carrier prospects the simply learns job assess receive guidance and resume support to stand out to the top employers applications close on November 8 2024 and enroll now to transform your career in data science and AI to check for the course Link in the description box below and pin comments now moving on to the sixth one that is applied Ai and data science Advance your career with applied Ai and data science program offered by Brown University’s School of Professional studies and collaboration with simply learn this 14 week CPL program empowers you with essential skills in AI generative Ai and data science including handson learning and Industry Rel projects learn from Steam Brown faculty through top not video content and monthly live master classes covering tools and Concepts such as python machine learning neural lent walking and jpt models benefit from a curriculum design to refine your expertise supported by integrated labs and exclusive content on generative AI andn a prestigious certificate of completion from Brown University and a credly badge upon program completion enhance your profile with simply L job asset resumee building support and exclusive I IM job membership to stand out in today’s competitive job market enroll now to gain The Cutting age knowledge and take your carrier in Ai and data science to the next level do check for the course Link in the description box below and pin comments now moving on to the next that is the data analyst elevate your career with simplys data analyst certification rank number one by carrier Karma this comprehensive 11 month program is designed to transform you into a data analytic expert with practical training and SQL R python data visualization and predictive analystics learn through live interactive classes Capstone projects and 20 plus Hands-On projects that ensure Real World Experience G industry recognized certifications from Simply learn and IBM access exclusive master classes and am sessions by IBM experts and receive dedicated job assistance to help you stand out to the top employers like Amazon Microsoft and Google start your journey to becoming a data analytics professional today with simply learns trusted and robust training program to check for the course Link in the description box below and pin comments now moving on to the next one that is data scientist Advance your career with simply learns industry leading data scientist certification program now ranked number one by carer Karma this 11 month course in collaboration with IBM equips you with the essential data science skills including python SQL machine learning generative Ai and W gain practical Real World experience to 25 plus Hands-On projects and a Capstone project benefit from master classes by IBM experts interactive live sessions led by industry professionals and lifetime access to the self placed learning content simply lears job assess program further boost your carer prospects helping you stand out to thep employers like Amazon Microsoft and Google to check for the course Link in the description box below and pin comments now moving on to the second last one that is the professional certificate program in data engineering launch your data engineering career with simply launch professional certificate program in data engineering offered in partnership with P University online this 32e program accuses with the indman skills covering python SQL nosql Big Data AWS Azure and snowflake fundamentals aligned with industry recognized certifications like AWS certified data engineer Microsoft 203 and snow Pro core this course ensures comprehensive learning through live online classes practical projects and a Capstone experience gain access to puru Alumni Association exclusive master classes and simplys job asset for carer support join now to become a certified data engineer and FASTT trck eradio to high impact roles in the field do check for the course Link in the description box below and pin comments now moving on to the last but not Le is Microsoft certified as your data engineer associate dp23 Advanced your carer will simply learns Microsoft certified Azure data engineer associate dp23 training aligned with official certification Master essential Azure skills like data integration transformation and storage while gaining hands-on experience with the key services such as Azure signups analytics data Factory and Azure data braas benefit from live online classes led by Microsoft certified trainers access to official Microsoft handbooks practice lab and comprehensive practice test to help you excellent dp23 exam this course designed for real world application ensures you develop job ready skills and earn a official course completion batch hosted on the Microsoft learn portal enroll now to elevate your data engineering expertise do check for the course Link in the description box below and pin comments so getting a data analytic certification can be a game changer for your growth however choosing the right certification is crucial it’s like finding the perfect key to unlock your potential select the one that best aligns with your career goals and S SK to maximize your journey in data analytics so that’s a WRA so that concludes our SQL full course if you have any doubts or question you can ask them in the comment section below our team of experts will reply you as soon as possible thank you and keep learning with simply staying ahead in your career requires continuous learning and upscaling whether you’re a student aiming to learn today’s top skills or a working professional looking to advance your career we’ve got you covered explore our impressive catalog of certification programs in cuttingedge domains including data science cloud computing cyber security AI machine learning or digital marketing designed in collaboration with leading universities and top corporations and delivered by industry experts choose any of our programs and set yourself on the path to Career Success click the link in the description to know more hi there if you like this video subscribe to the simply learn YouTube channel and click here to watch similar videos to nerd up and get certified click here

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    YouTube Video

    SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • Modern SQL Data Warehouse Project: A Comprehensive Guide

    Modern SQL Data Warehouse Project: A Comprehensive Guide

    This source details the creation of a modern data warehouse project using SQL. It presents a practical guide to designing data architecture, writing code for data transformation and loading, and creating data models. The project emphasizes real-world implementation, focusing on organizing and preparing data for analysis. The resource covers the ETL process, data quality, and documentation while building bronze, silver, and gold layers. It provides a comprehensive approach to data warehousing, from understanding requirements to creating a professional portfolio project.

    Modern SQL Data Warehouse Project Study Guide

    Quiz:

    1. What is the primary purpose of data warehousing projects? Data warehousing projects focus on organizing, structuring, and preparing data for data analysis, forming the foundation for any data analytics initiatives.
    2. Briefly explain the ETL/ELT process in SQL data warehousing. ETL/ELT in SQL involves extracting data from various sources, transforming it to fit the data warehouse schema (cleaning, standardizing), and loading it into the data warehouse for analysis and reporting.
    3. According to Bill Inmon’s definition, what are the four key characteristics of a data warehouse? According to Bill Inmon’s definition, the four key characteristics of a data warehouse are subject-oriented, integrated, time-variant, and non-volatile.
    4. Why is creating a project plan crucial for data warehouse projects, according to the source? Creating a project plan is crucial for data warehouse projects because they are complex, and a clear plan improves the chances of success by providing organization and direction, reducing the risk of failure.
    5. What is the “separation of concerns” principle in data architecture, and why is it important? The “separation of concerns” principle involves breaking down a complex system into smaller, independent parts, each responsible for a specific task, to avoid mixing everything and to maintain a clear and efficient architecture.
    6. Explain the purpose of the bronze, silver, and gold layers in a data warehouse architecture. The bronze layer stores raw, unprocessed data directly from the source systems, the silver layer contains cleaned and standardized data, and the gold layer holds business-ready data transformed and aggregated for reporting and analysis.
    7. What are metadata columns, and why are they useful in a data warehouse? Metadata columns are additional columns added to tables by data engineers to provide extra information about each record, such as create date or source system, aiding in data tracking and troubleshooting.
    8. What is a surrogate key, and why is it used in data modeling? A surrogate key is a system-generated unique identifier assigned to each record to make the record unique. It provides more control over the data model without dependence on source system keys.
    9. Describe the star schema data model, including the roles of fact and dimension tables. The star schema is a data modeling approach with a central fact table surrounded by dimension tables. Fact tables contain events or transactions, while dimension tables hold descriptive attributes, related via foreign keys.
    10. Explain the importance of clear documentation for end users of a data warehouse, as highlighted in the source.

    Clear documentation is essential for end users to understand the data model and use the data warehouse effectively.

    Quiz Answer Key:

    1. Data warehousing projects focus on organizing, structuring, and preparing data for data analysis, forming the foundation for any data analytics initiatives.
    2. ETL/ELT in SQL involves extracting data from various sources, transforming it to fit the data warehouse schema (cleaning, standardizing), and loading it into the data warehouse for analysis and reporting.
    3. According to Bill Inmon’s definition, the four key characteristics of a data warehouse are subject-oriented, integrated, time-variant, and non-volatile.
    4. Creating a project plan is crucial for data warehouse projects because they are complex, and a clear plan improves the chances of success by providing organization and direction, reducing the risk of failure.
    5. The “separation of concerns” principle involves breaking down a complex system into smaller, independent parts, each responsible for a specific task, to avoid mixing everything and to maintain a clear and efficient architecture.
    6. The bronze layer stores raw, unprocessed data directly from the source systems, the silver layer contains cleaned and standardized data, and the gold layer holds business-ready data transformed and aggregated for reporting and analysis.
    7. Metadata columns are additional columns added to tables by data engineers to provide extra information about each record, such as create date or source system, aiding in data tracking and troubleshooting.
    8. A surrogate key is a system-generated unique identifier assigned to each record to make the record unique. It provides more control over the data model without dependence on source system keys.
    9. The star schema is a data modeling approach with a central fact table surrounded by dimension tables. Fact tables contain events or transactions, while dimension tables hold descriptive attributes, related via foreign keys.
    10. Clear documentation is essential for end users to understand the data model and use the data warehouse effectively.

    Essay Questions:

    1. Discuss the importance of data quality in a modern SQL data warehouse project. Explain the role of the bronze and silver layers in ensuring high data quality, and provide examples of data transformations that might be performed in the silver layer.
    2. Describe the Medan architecture and how it’s implemented using bronze, silver, and gold layers. Discuss the advantages of this architecture, including separation of concerns and data quality management, and explain how data flows through each layer.
    3. Explain the process of creating a detailed project plan for a data warehouse project using a tool like Notion. Describe the key phases and stages involved, the importance of defining epics and tasks, and how this plan contributes to project success.
    4. Explain the importance of source system analysis in a data warehouse project, and describe the key questions that should be asked when connecting to a new source system.
    5. Compare and contrast the star schema with other data modeling approaches, such as snowflake and data vault. Discuss the advantages and disadvantages of the star schema for reporting and analytics, and explain the roles of fact and dimension tables in this model.

    Glossary of Key Terms:

    • Data Warehouse: A subject-oriented, integrated, time-variant, and non-volatile collection of data designed to support management’s decision-making process.
    • ETL (Extract, Transform, Load): A process in data warehousing where data is extracted from various sources, transformed into a suitable format, and loaded into the data warehouse.
    • ELT (Extract, Load, Transform): A process similar to ETL, but the transformation step occurs after the data has been loaded into the data warehouse.
    • Data Architecture: The overall structure and design of data systems, including databases, data warehouses, and data lakes.
    • Data Integration: The process of combining data from different sources into a unified view.
    • Data Modeling: The process of creating a visual representation of data structures and relationships.
    • Bronze Layer: The first layer in a data warehouse architecture, containing raw, unprocessed data from source systems.
    • Silver Layer: The second layer in a data warehouse architecture, containing cleaned and standardized data ready for transformation.
    • Gold Layer: The third layer in a data warehouse architecture, containing business-ready data transformed and aggregated for reporting and analysis.
    • Subject-Oriented: Focused on a specific business area, such as sales, customers, or finance.
    • Integrated: Combines data from multiple source systems into a unified view.
    • Time-Variant: Keeps historical data for analysis over time.
    • Non-Volatile: Data is not deleted or modified once it enters the data warehouse.
    • Project Epic: A large task or stage in a project that requires significant effort to complete.
    • Separation of Concerns: A design principle that breaks down complex systems into smaller, independent parts, each responsible for a specific task.
    • Data Cleansing: The process of correcting or removing inaccurate, incomplete, or irrelevant data.
    • Data Standardization: The process of converting data into a consistent format or standard.
    • Metadata Columns: Additional columns added to tables to provide extra information about each record, such as creation date or source system.
    • Surrogate Key: A system-generated unique identifier assigned to each record, used to connect data models and avoid dependence on source system keys.
    • Star Schema: A data modeling approach with a central fact table surrounded by dimension tables.
    • Fact Table: A table in a data warehouse that contains events or transactions, along with foreign keys to dimension tables.
    • Dimension Table: A table in a data warehouse that contains descriptive attributes or categories related to the data in fact tables.
    • Data Lineage: Tracking the origin and movement of data from its source to its final destination.
    • Stored Procedure: A precompiled collection of SQL statements stored under a name and executed as a single unit.
    • Data Normalization: The process of organizing data to reduce redundancy and improve data integrity.
    • Data Lookup: Joining tables to retrieve specific data, such as surrogate keys, from related dimensions.
    • Data Flow Diagram: A visual representation of how data moves through a system.

    Modern SQL Data Warehouse Project Guide

    Okay, here’s a detailed briefing document summarizing the main themes and ideas from the provided text excerpts.

    Briefing Document: Modern SQL Data Warehouse Project

    Overview:

    This document summarizes the key concepts and practical steps outlined in a guide for building a modern SQL data warehouse. The guide, presented by Bar Zini, aims to equip data architects, data engineers, and data modelers with real-world skills by walking them through the creation of a data warehouse project using SQL Server (though adaptable to other SQL databases). The project emphasizes best practices and provides a professional portfolio piece upon completion.

    Main Themes and Key Ideas:

    1. Data Warehousing Fundamentals:
    • Definition: The project begins by defining a data warehouse using Bill Inmon’s classic definition: “A data warehouse is subject oriented, integrated, time variant, and nonvolatile collection of data designed to support the Management’s decision-making process.”
    • Subject Oriented: Focused on business areas (e.g., sales, customers, finance).
    • Integrated: Combines data from multiple source systems.
    • Time Variant: Stores historical data.
    • Nonvolatile: Data is not deleted or modified once entered.
    • Purpose: To address the inefficiencies of data analysts extracting and transforming data directly from operational systems, replacing it with an organized and structured data system as a foundation for data analytics projects.
    • SQL Data Warehousing in Relation to Other Types of Data Analytics Projects: The guide mentions that SQL Data Warehousing is the foundation of any data analytics projects and that it is the first step before being able to do exploratory data analyzes (EDA) and Advanced analytics projects.
    1. Project Structure and Skills Developed:
    • Roles: The project is designed to provide experience in three key roles: data architect, data engineer, and data modeler.
    • Skills: Participants will learn:
    • ETL/ELT processing using SQL.
    • Data architecture design.
    • Data integration (merging multiple sources).
    • Data loading and data modeling.
    • Portfolio Building: The guide emphasizes the project’s value as a portfolio piece for demonstrating skills on platforms like LinkedIn.
    1. Project Setup and Planning (Using Notion):
    • Importance of Planning: The guide stresses that “creating a project plan is the key to success.” This is particularly important for data warehouse projects, where a high failure rate (over 50%, according to Gartner reports) is attributed to complexity.
    • Iterative Planning: The planning process is described as iterative. An initial “rough project plan” is created, which is then refined as understanding of the data architecture evolves.
    • Project Epics (Main Phases): The initial project phases identified are:
    • Requirements analysis.
    • Designing the data architecture.
    • Project initialization.
    • Task Breakdown: The project uses Notion (a free tool) to organize the project into epics and subtasks, enabling a structured approach.
    • It is also mentioned the importance of icons to add a personal style to the project and to keep it more organized.
    • Project success: One important element of the project to be successful is to be able to visualize the whole picture in the project by closing small chunks of work and tasks that gives a sense of motivation and accomplishment.
    1. Data Architecture Design (Using Draw.io):
    • Medallion Architecture: The guide advocates for a “Medallion architecture” (Bronze, Silver, Gold layers) within the data warehouse.
    • Separation of Concerns: A core architectural principle is “separation of concerns.” This means breaking down the complex system into independent parts, each responsible for a specific task, with no duplication of components. “A good data architect follow this concept this principle.”
    • Layer Responsibilities:Bronze Layer (Raw Data): Contains raw data, with no transformations. “In the bronze layer it’s going to be the row data.”
    • Silver Layer (Cleaned and Standardized Data): Focuses on data cleansing and standardization. “In the silver you are cleans standard data.”
    • Gold Layer (Business-Ready Data): Contains business-transformed data ready for analysis. “For the gos we can say business ready data.”
    • Data Flow Diagram: The project utilizes Draw.io (a free diagramming tool) to visualize the data architecture and data lineage.
    • Naming Conventions: A naming convention is created to ensure clarity and consistency, creating specific naming rules for tables and columns. Examples include fact_sales for a fact table and dim_customers for a dimension. It is recommended to create clear documentation about each rule and to add examples so that there is a general consensus about how to proceed.
    1. Project Initialization and Tools:
    • Software: The project uses SQL Server Express (database server) and SQL Server Management Studio (client for interacting with the database). Other tools include GitHub and Draw.io. Notion is used for project management.
    • Initial Database Setup: The guide outlines the creation of a new database and schemas (Bronze, Silver, Gold) within SQL Server.
    • Git Repository: The project emphasizes the importance of using Git for version control and collaboration. A repository structure is established with folders for data sets, documents, scripts, and tests.
    • ReadMe: it is important to create a read me file at the root of the repo where the main characteristics and goal of the repo are specified so that other developers can have a better understanding of the project when collaborating.
    1. Building the Bronze Layer
    • The process to build the bronze layer is by first doing data analysis about what is to be built. The goal of this first process is to interview source system experts, identify the source of the data, the size of the data to be processed, the performance of the source system so that it is not to be affected and authentication/authorization like access tokens, keys and passwords.
    • The project also makes a step-by-step approach from creating all the required queries and stored procedures to loading them efficiently. This step contains steps about testing that the tables have no nulls and that the separator used matches with the data.
    1. Building the Silver Layer
    • The specifications of the silver layer are to have clean and standardized data and building tables inside the silver layer. The data should be loaded from the bronze layer using full load, truncating and then inserting the data after which we will apply a lot of data transformation.
    • In the silver layer, we will implement metadata columns where more data information is stored that doesn’t come directly from the source system. Some examples that can be stored are create and update dates, the source system, and the file location where this data came from. This can help track where there are corrupted data as well as find if there is a gap in the imported data.
    1. Building the Gold Layer

    *The gold layer is very focused on business goals and should be easy to consume for business reports. That is why we will create a data model for our business area. *When implementing a data model, it should contain two types of tables: fact tables and dimension tables. Dimension tables are descriptive and give some context to the data. One example of a dimension table is to use product info to use the product name, category and subcategories. Fact tables are events like transactions that contain IDs from dimensions. The question to define whether we should use a dimension table or a fact table comes to be: * How much and How many: fact table *Who, What, and Where: dimension table

    1. General Data Cleaning
    • In the project we will be building data transformations and cleansing where we will be writing insert statements that will have functions where the data will be transformed and cleaned up. This will include data checks in the primary keys, handling unwanted space, identifying the inconsistencies of the cardinality (the number of elements in a table) where we will be replacing null values, and fixing the dates and values of the sales order.
    • During the data cleaning process, one tool to check the quality of our data is through quality checks where we can go and select data that is incorrect, and then we can have a quick fix. For any numerical column it is best to validate it against the negative numbers, null values, and against the data type to make sure to convert into the right format. *In the silver layer, some techniques will have to be applied for the data that is old, in that case, it will have to be removed or have a flag, and for the birthday, we can filter data in the future. *To find errors in SQL, it is possible to use try and catch in between code blocks and then print error messages, numbers, and states so that the messages can be handled to find errors easier. *There is a lot of information that might have missing values. The code includes techniques to fill missing values and then also to provide data normalization.

    In summary, this guide provides a comprehensive, practical approach to building a modern SQL data warehouse, emphasizing structured planning, sound architectural principles, and hands-on coding experience. The emphasis on building a portfolio project makes it particularly valuable for those seeking to demonstrate their data warehousing skills.

    SQL Data Warehouse Fundamentals

    # What is a modern SQL data warehouse?

    A modern SQL data warehouse, according to the excerpt from “A Journey Through Grief”, is a subject-oriented, integrated, time-variant, and non-volatile collection of data designed to support management’s decision-making process. It consolidates data from multiple source systems, organizes it around business subjects (like sales, customers, or finance), retains historical data, and ensures that the data is not deleted or modified once loaded.

    # What are the key roles involved in building a data warehouse project?

    According to the excerpt from “A Journey Through Grief”, building a data warehouse involves different roles including:

    * **Data Architect:** Designs the overall data architecture following best practices.

    * **Data Engineer:** Writes code to clean, transform, load, and prepare data.

    * **Data Modeler:** Creates the data model for analysis.

    # What are the three types of data analytics projects that can be done using SQL?

    The three types of data analytics projects, according to the excerpt from “A Journey Through Grief”, are:

    * **Data Warehousing:** Focuses on organizing, structuring, and preparing data for analysis, which is foundational for other analytics projects.

    * **Exploratory Data Analysis (EDA):** Involves understanding and uncovering insights from datasets by asking the right questions and finding answers using basic SQL skills.

    * **Advanced Analytics Projects:** Uses advanced SQL techniques to answer business questions, such as identifying trends, comparing performance, segmenting data, and generating reports.

    # What is the Medici architecture and why is it relevant to designing a data warehouse?

    The Medici architecture is a layered approach to data warehousing, which this source calls “Medan” and which is composed of:

    * **Bronze Layer:** Raw data “as is” from source systems.

    * **Silver Layer:** Cleaned and standardized data.

    * **Gold Layer:** Business-ready data with transformed and aggregated information.

    The Medici architecture enables separation of concerns, allowing unique sets of tasks for each layer, and helps organize and manage the complexity of data warehousing. It provides a structured approach to data processing, ensuring data quality and consistency.

    # What tools are commonly used in data warehouse projects, and why is creating a project plan important?

    Common tools used in data warehouse projects include:

    * **SQL Server Express:** A local server for the database.

    * **SQL Server Management Studio (SSMS):** A client to interact with the database and run queries.

    * **GitHub:** For version control and collaboration.

    * **draw.io:** A tool for creating diagrams, data models, data architectures and data lineage.

    * **Notion:** A tool for project management, planning, and organizing resources.

    Creating a project plan is essential for success due to the complexity of data warehouse projects. A clear plan helps organize tasks, manage resources, and track progress.

    # What is data lineage, and why is it important in a data warehouse environment?

    Data lineage refers to the data’s journey from its origin in source systems, through various transformations, to its final destination in the data warehouse. It provides visibility into the data’s history, transformations, and dependencies. Data lineage is crucial for troubleshooting data quality issues, understanding data flows, ensuring compliance, and auditing data processes.

    # What are surrogate keys, and why are they used in data modeling?

    Surrogate keys are system-generated unique identifiers assigned to each record in a dimension table. They are used to ensure uniqueness, simplify data relationships, and insulate the data warehouse from changes in source system keys. Surrogate keys provide control over the data model and facilitate efficient data integration and querying.

    # What are some essential naming conventions for data warehouse projects, and why are they important?

    Essential naming conventions help ensure consistency and clarity across the data warehouse. Examples include:

    * Using prefixes to indicate the type of table (e.g., `dim_` for dimension, `fact_` for fact).

    * Consistent naming of columns (e.g., surrogate keys ending with `_key`, technical columns starting with `dw_`).

    * Standardized naming for stored procedures (e.g., `load_bronze` for bronze layer loading).

    These conventions improve collaboration, code readability, and maintenance, enabling efficient data management and analysis.

    Data Warehousing: Architectures, Models, and Key Concepts

    Data warehousing involves organizing, structuring, and preparing data for analysis and is the foundation for any data analytics project. It focuses on how to consolidate data from various sources into a centralized repository for reporting and analysis.

    Key aspects of data warehousing:

    • A data warehouse is subject-oriented, integrated, time-variant, and a nonvolatile collection of data designed to support management’s decision-making process.
    • Subject-oriented: Focuses on specific business areas like sales, customers, or finance.
    • Integrated: Integrates data from multiple source systems.
    • Time-variant: Keeps historical data.
    • Nonvolatile: Data is not deleted or modified once it’s in the warehouse.
    • ETL (Extract, Transform, Load): A process to extract data from sources, transform it, and load it into the data warehouse, which then becomes the single source of truth for analysis and reporting.
    • Benefits of a data warehouse:
    • Organized data: A data warehouse helps organize data so that the data team is not fighting with the data.
    • Single point of truth: Serves as a single point of truth for analyses and reporting.
    • Automation: Automates the data collection and transformation process, reducing manual errors and processing time.
    • Historical data: Enables access to historical data for trend analysis.
    • Data integration: Integrates data from various sources, making it easier to create integrated reports.
    • Improved decision-making: Provides fresh and reliable reports for making informed decisions.
    • Data Management: Data management is important for making real and good decisions.
    • Data Modeling: Data modeling is creating a new data model for analyses.

    Different Approaches to Data Warehouse Architecture:

    • Inmon Model: Uses a three-layer approach (staging, enterprise data warehouse, and data marts) to organize and model data.
    • Kimball Model: Focuses on quickly building data marts, which may lead to inconsistencies over time.
    • Data Vault: Adds more standards and rules to the central data warehouse layer by splitting it into raw and business vaults.
    • Medallion Architecture: Uses three layers: bronze (raw data), silver (cleaned and standardized data), and gold (business-ready data).

    The Medallion architecture consists of the following:

    • Bronze Layer: Stores raw, unprocessed data directly from the sources for traceability and debugging.
    • Data is not transformed in this layer.
    • Typically uses tables as object types.
    • Full load method is applied.
    • Access restricted to data engineers only.
    • Silver Layer: Stores clean and standardized data with basic transformations.
    • Focuses on data cleansing, standardization, and normalization.
    • Uses tables as object types.
    • Full load method is applied.
    • Accessible to data engineers, data analysts, and data scientists.
    • Gold Layer: Contains business-ready data for consumption by business users and analysts.
    • Applies business rules, data integration, and aggregation.
    • Uses views as object types for dynamic access.
    • Suitable for data analysts and business users.

    The ETL Process: Extract, Transform, and Load

    The ETL (Extract, Transform, Load) process is a critical component of data warehousing used to extract data from various sources, transform it into a usable format, and load it into a data warehouse. The data warehouse then becomes the single point of truth for analyses and reporting.

    The ETL process consists of three key stages:

    • Extract: Involves identifying and extracting data from source systems without changing it. The goal is to pull out a subset of data from the source in order to prepare it and load it to the target. This step focuses solely on data retrieval, maintaining a one-to-one correspondence with the source system.
    • Transform: Manipulates and transforms the extracted data into a format suitable for analysis and reporting. This stage may include data cleansing, integration, formatting, and normalization to reshape the data into the required format.
    • Load: Inserts the transformed data into the target data warehouse. The prepared data from the transformation step is moved into its final destination, such as a data warehouse.

    In real-world projects, the data architecture may have multiple layers, and the ETL process can vary between these layers. Depending on the data architecture’s design, it is not always necessary to use the complete ETL process to move data from a source to a target. For example, data can be loaded directly to a layer without transformations or undergo only transformation or loading steps between layers.

    Different techniques and methods exist within each stage of the ETL process:

    Extraction:

    • Methods:
    • Pull: The data warehouse pulls data from the source system.
    • Push: The source system pushes data to the data warehouse.
    • Types:
    • Full Extraction: All records from the source tables are extracted.
    • Incremental Extraction: Only new or changed data is extracted.
    • Techniques:
    • Manual extraction
    • Querying a database
    • Parsing a file
    • Connecting to an API
    • Event-based streaming
    • Change data capture (CDC)
    • Web scraping

    Transformation:

    • Data enrichment
    • Data integration
    • Deriving new columns
    • Data normalization
    • Applying business rules and logic
    • Data aggregation
    • Data cleansing:
    • Removing duplicates
    • Data filtering
    • Handling missing data
    • Handling invalid values
    • Removing unwanted spaces
    • Casting data types
    • Detecting outliers

    Load:

    • Processing Types:
    • Batch Processing: Loading the data warehouse in one large batch of data.
    • Stream Processing: Processing changes as soon as they occur in the source system.
    • Methods:
    • Full Load:
    • Truncate and insert
    • Upsert (update and insert)
    • Drop, create, and insert
    • Incremental Load:
    • Upsert
    • Insert (append data)
    • Merge (update, insert, delete)
    • Slowly Changing Dimensions (SCD):
    • SCD0: No historization; no changes are tracked.
    • SCD1: Overwrite; records are updated with new information, losing history.
    • SCD2: Add historization by inserting new records for each change and inactivating old records.

    Data Modeling for Warehousing and Business Intelligence

    Data modeling is the process of organizing and structuring raw data into a meaningful way that is easy to understand. In data modeling, data is put into new, friendly, and easy-to-understand formats like customers, orders, and products. Each format is focused on specific information, and the relationships between those objects are described. The goal is to create a logical data model.

    For analytics, especially in data warehousing and business intelligence, data models should be optimized for reporting, flexible, scalable, and easy to understand.

    Different Stages of Data Modeling:

    • Conceptual Data Model: Focuses on identifying the main entities (e.g., customers, orders, products) and their relationships without specifying details like columns or attributes.
    • Logical Data Model: Specifies columns, attributes, and primary keys for each entity and defines the relationships between entities.
    • Physical Data Model: Includes technical details like data types, lengths, and database-specific configurations for implementing the data model in a database.

    Data Models for Data Warehousing and Business Intelligence:

    • Star Schema: Features a central fact table surrounded by dimension tables. The fact table contains events or transactions, while dimensions contain descriptive information. The relationship between fact and dimension tables forms a star shape.
    • Snowflake Schema: Similar to the star schema but breaks down dimensions into smaller sub-dimensions, creating a more complex, snowflake-like structure.

    Comparison of Star and Snowflake Schemas:

    • Star Schema:
    • Easier to understand and query.
    • Suitable for reporting and analytics.
    • May contain duplicate data in dimensions.
    • Snowflake Schema:
    • More complex and requires more knowledge to query.
    • Optimizes storage by reducing data redundancy through normalization.
    • The star schema is commonly used and perfect for reporting.

    Types of Tables:

    • Fact Tables: Contain events or transactions and include IDs from multiple dimensions, dates, and measures. They answer questions about “how much” or “how many”.
    • Dimension Tables: Provide descriptive information and context about the data, answering questions about “who,” “what,” and “where”.

    In the gold layer, data modeling involves creating new structures that are easy to consume for business reporting and analyses.

    Data Transformation: ETL Process and Techniques

    Data transformation is a key stage in the ETL (Extract, Transform, Load) process where extracted data is manipulated and converted into a format that is suitable for analysis and reporting. It occurs after data has been extracted from its source and before it is loaded into the target data warehouse. This process is essential for ensuring data quality, consistency, and relevance in the data warehouse.

    Here’s a detailed breakdown of data transformation, drawing from the sources:

    Purpose and Importance

    • Data transformation changes the shape of the original data.
    • It is a heavy working process that can include data cleansing, data integration, and various formatting and normalization techniques.
    • The goal is to reshape and reformat original data to meet specific analytical and reporting needs.

    Types of Transformations There are various types of transformations that can be performed:

    • Data Cleansing:
    • Removing duplicates to ensure each primary key has only one record.
    • Filtering data to retain relevant information.
    • Handling missing data by filling in blanks with default values.
    • Handling invalid values to ensure data accuracy.
    • Removing unwanted spaces or characters to ensure consistency.
    • Casting data types to ensure compatibility and correctness.
    • Detecting outliers to identify and manage anomalous data points.
    • Data Enrichment: Adding value to data sets by including relevant information.
    • Data Integration: Bringing multiple sources together into a unified data model.
    • Deriving New Columns: Creating new columns based on calculations or transformations of existing ones.
    • Data Normalization: Mapping coded values to user-friendly descriptions.
    • Applying Business Rules and Logic: Implementing criteria to build new columns based on business requirements.
    • Data Aggregation: Aggregating data to different granularities.
    • Data Type Casting: Converting data from one data type to another.

    Data Transformation in the Medallion Architecture In the Medallion architecture, data transformation is strategically applied across different layers:

    • Bronze Layer: No transformations are applied. The data remains in its raw, unprocessed state.
    • Silver Layer: Focuses on basic transformations to clean and standardize data. This includes data cleansing, standardization, and normalization.
    • Gold Layer: Focuses on business-related transformations needed for the consumers, such as data integration, data aggregation, and the application of business logic and rules. The goal is to provide business-ready data that can be used for reporting and analytics.

    SQL Server for Data Warehousing

    The sources mention SQL Server as a tool used for building data warehouses. It is a platform that can run locally on a PC where a database can reside.

    Here’s what the sources indicate about using SQL Server in the context of data warehousing:

    • Building a data warehouse: SQL Server can be used to develop a modern data warehouse.
    • Project platform: In at least one of the projects described in the sources, the data warehouse was built completely in SQL Server.
    • Data loading: SQL Server is used to load data from source files, such as CSV files, into database tables. The BULK INSERT command is used to load data quickly from a file into a table.
    • Database and schema creation: SQL scripts are used to create a database and schemas within SQL Server to organize data.
    • SQL Server Management Studio: SQL Server Management Studio is a client tool used to interact with the database and run queries.
    • Three-layer architecture: The SQL Server database is organized into three schemas corresponding to the bronze, silver, and gold layers of a data warehouse.
    • DDL scripts: DDL (Data Definition Language) scripts are created and executed in SQL Server to define the structure of tables in each layer of the data warehouse.
    • Stored procedures: Stored procedures are created in SQL Server to encapsulate ETL processes, such as loading data from CSV files into the bronze layer.
    • Data quality checks: SQL queries are written and executed in SQL Server to validate data quality, such as checking for duplicates or null values.
    • Views in the gold layer: Views are created in the gold layer of the data warehouse within SQL Server to provide a business-ready, integrated view of the data.
    SQL Data Warehouse from Scratch | Full Hands-On Data Engineering Project

    The Original Text

    hey friends so today we are diving into something very exciting Building Together modern SQL data warehouse projects but this one is not any project this one is a special one not only you will learn how to build a modern Data Warehouse from the scratch but also you will learn how I implement this kind of projects in Real World Companies I’m bar zini and I have built more than five successful data warehouse projects in different companies and right now I’m leading big data and Pi Projects at Mercedes-Benz so that’s me I’m sharing with you real skills real Knowledge from complex projects and here’s what you will get out of this project as a data architect we will be designing a modern data architecture following the best practices and as a data engineer you will be writing your codes to clean transform load and prepare the data for analyzis and as a data Modell you will learn the basics of data moding and we will be creating from the scratch a new data model for analyzes and my friends by the end of this project you will have a professional portfolio project to Showcase your new skills for example on LinkedIn so feel free to take the project modify it and as well share it with others but it going to mean the work for me if you share my content and guess what everything is for free so there are no hidden costs at all and in this project we will be using SQL server but if you prefer other databases like my SQL or bis don’t worry you can follow along just fine all right my friends so now if you want to do data analytics projects using SQL we have three different types the first type of projects you can do data warehousing it’s all about how to organize structure and prepare your data for data analysis it is the foundations of any data analytics projects and in The Next Step you can do exploratory data analyzes Eda and all what you have to do is to understand and cover insights about our data sets in this kind of project you can learn how to ask the right questions and how to find the answer using SQL by just using basic SQL skills now moving on to the last stage where you can do Advanced analytics projects where you going to use Advanced SQL techniques in order to answer business questions like finding Trends over time comparing the performance segmenting your data into different sections and as well generate reports for your stack holders so here you will be solving real business questions using Advanced SQL techniques now what we’re going to do we’re going to start with the first type of projects SQL data warehousing where you will gain the following skills so first you will learn how to do ETL elt processing using SQL in order to prepare the data you will learn as well how to build data architecture how to do data Integrations where we can merge multiple sources together and as well how to do data load and data modeling so if I got you interested grab your coffee and let’s jump to the projects all right my friends so now before we Deep dive into the tools and the cool stuff we have first to have good understanding about what is exactly a data warehouse why the companies try to build such a data management system so now the question is what is a data warehouse I will just use the definition of the father of the data warehouse Bill Inon a data warehouse is subject oriented integrated time variance and nonvolatile collection of data designed to support the Management’s decision-making process okay I I know that might be confusing subject oriented it means thata Warehouse is always focused on a business area like the sales customers finance and so on integrated because it goes and integrate multiple Source systems usually you build a warehouse not only for one source but for multiple sources time variance it means you can keep historical data inside the data warehouse nonvolatile it means once the data enter the data warehouse it is not deleted or modified so this is how build and mod defined data warehouse okay so now I’m going to show you the scenario where your company don’t have a real data management so now let’s say that you have one system and you have like one data analyst has to go to this system and start collecting and extracting the data and then he going to spend days and sometimes weeks transforming the row data into something meaningful then once they have the report they’re going to go and share it and this data analyst is sharing the report using an Excel and then you have like another source of data and you have another data analyst that she is doing maybe the same steps collecting the data spending a lot of time transforming the data and then share at the end like a report and this time she is sharing the data using PowerPoint and a third system and the same story but this time he is sharing the data using maybe powerbi so now if the company works like this then there is a lot of issues first this process it take too way long I saw a lot of scenarios where sometimes it takes weeks and even months until the employee manually generating those reports and of course what going to happen for the users they are consuming multiple reports with multiple state of the data one report is 40 days old another one 10 days and a third one is like 5 days so it’s going to be really hard to make a real decision based on this structure a manual process is always slow and stressful and the more employees you involved in the process the more you open the door for human errors and errors of course in reports leads to bad decisions and another issue of course is handling the Big Data if one of your sources generating like massive amount of data then the data analyst going to struggle collecting the data and maybe in some scenarios it will not be any more possible to get the data so the whole process can breaks and you cannot generate any more fresh data for specific reports and one last very big issue with that if one of your stack holders asks for an integrated report from multiple sources well good luck with that because merging all those data manually is very chaotic timec consuming and full of risk so this is just a picture if a company is working without a proper data management without a data leak data warehouse data leak houses so in order to make real and good decisions you need data management so now let’s talk about the scenario of a data warehouse so the first thing that can happen is that you will not have your data team collecting manually the data you’re going to have a very important component called ETL ETL stands for extract transform and load it is a process that you do in order to extract the data from the sources and then apply multiple Transformations on those sources and at the end it loads the data to the data warehouse and this one going to be the single point of Truth for analyzes and Reporting and it is called Data Warehouse so now what can happen all your reports going to be consuming this single point of Truth so with that you create your multiple reports and as well you can create integrated reports from multiple sources not only from one single source so now by looking to the right side it looks already organized right and the whole process is completely automated there is no more manual steps which of course it ru uses the human error and as well it is pretty fast so usually you can load the data from the sources until the reports in matter of hours or sometimes in minutes so there is no need to wait like weeks and months in order to refresh anything and of course the big Advantage is that the data warehouse itself it is completely integrated so that means it goes and bring all those sources together in one place which makes it really easier for reporting and not only integrate you can build in the data warehouse as well history so we have now the possibility to access historical data and what is also amazing that all those reports having the same data status so all those reports can have the same status maybe sometimes one day old or something and of course if you have a modern Data Warehouse in Cloud platforms you can really easily handle any big data sources so no need to panic if one of your sources is delivering massive amount of data and of course in order to build the data warehouse you need different types of Developers so usually the one that builds the ATL component and the data warehouse is the data engineer so they are the one that is accessing the sources scripting the atls and building the database for the data warehouse and now for the other part the one that is responsible for that is the data analyst they are the one that is consuming the data warehouse building different data models and reports and sharing it with the stack holders so they are usually contacting the stack holders understanding the requirements and building multiple reports based on the data warehouse so now if you have a look to those two scenarios this is exactly why we need data management your data team is not wasting time and fighting with the data they are now more organized and more focused and with like data warehouse and you are delivering professional and fresh reports that your company can count on in order to make good and fast decisions so this is why you need a data management like a data warehouse think about data warehouse as a busy restaurant every day different suppliers bring in fresh ingredients vegetables spices meat you name it they don’t just use it immediately and throw everything in one pot right they clean it shop it and organize everything and store each ingredients in the right place fridge or freezer so this is the preparing face and when the order comes in they quickly grab the prepared ingredients and create a perfect dish and then serve it to the customers of the restaurant and this process is exactly like the data warehouse process it is like the kitchen where the raw ingredients your data are cleaned sorted and stored and when you need a report or analyzes it is ready to serve up exactly like what you need okay so now we’re going to zoom in and focus on the component ETL if you are building such a project you’re going to spend almost 90% just building this component the ATL so it is the core element of the data warehouse and I want you to have a clear understanding what is exactly an ETL so our data exist in a source system and now what we want to do is is to get our data from the source and move it to the Target source and Target could be like database tables so now the first step that we have to do is to specify which data we have to load from the source of course we can say that we want to load everything but let’s say that we are doing incremental loads so we’re going to go and specify a subset of the data from The Source in order to prepare it and load it later to the Target so this step in the ATL process we call it extract we are just identifying the data that we need we pull it out and we don’t change anything it’s going to be like one to one like the source system so the extract has only one task to identify the data that you have to pull out from the source and to not change anything so we will not manipulate the data at all it can stay as it is so this is the first step in the ETL process the extracts now moving on to the stage number two we’re going to take this extract data and we will do some manipulations Transformations and we’re going to change the shape of those data and this process is really heavy working we can do a lot of stuff like data cleansing data integration and a lot of formatting and data normalizations so a lot of stuff we can do in this step so this is the second step in the ETL process the transformation we’re going to take the original data and reshape it transformat into exactly the format that we need into a new format and shapes that we need for anal and Reporting now finally we get to the last step in the ATL process we have the load so in this step we’re going to take this new data and we’re going to insert it into the targets so it is very simple we’re going to take this prepared data from the transformation step and we’re going to move it into its final destination the target like for example data warehouse so that’s ETL in the nutshell first extract the row data then transform it into something meaningful and finally load it to a Target where it’s going to make a difference so that’s that’s it this is what we mean with the ETL process now in real projects we don’t have like only source and targets our thata architecture going to have like multiple layers depend on your design whether you are building a warehouse or a data lake or a data warehouse and usually there are like different ways on how to load the data between all those layers and in order now to load the data from one layer to another one there are like multiple ways on how to use the ATL process so usually if you are loading the data from the source to the layer number one like only the data from the source and load it directly to the layer number one without doing any Transformations because I want to see the data as it is in the first layer and now between the layer number one and the layer number two you might go and use the full ETL so we’re going to extract from the layer one transform it and then load it to the layer number two so with that we are using the whole process the ATL and now between Layer Two and layer three we can do only transformation and then load so we don’t have to deal with how to extract the data because it is maybe using the same technology and we are taking all data from Layer Two to layer three so we transform the whole layer two and then load it to layer three and now between three and four you can use only the L so maybe it’s something like duplicating and replicating the data and then you are doing the transformation so you load to the new layer and then transform it of course this is not a real scenario I’m just showing you that in order to move from source to a Target you don’t have always to use a complete ETL depend on the design of your data architecture you might use only few components from the ETL okay so this is how ETL looks like in real projects okay so now I would like to show you an overview of the different techniques and methods in the etls we have wide range of possibilities where you have to make decisions on which one you want to apply to your projects so let’s start first with the extraction the first thing that I want to show you is we have different methods of extraction either you are going to The Source system and pulling the data from the source or the source system is pushing the data to the data warehouse so those are the two main methods on how to extract data and then we have in the extraction two types we have a full extraction everything all the records from tables and every day we load all the data to the data warehouse or we make more smarter one where we say we’re going to do an incremental extraction where every day we’re going to identify only the new changing data so we don’t have to load the whole thing only the new data we go extract it and then load it to the data warehouse and in data extraction we have different techniques the first one is like manually where someone has to access a source system and extract the data manually or we connect ourself to a database and we have then a query in order to extract the data or we have a file that we have to pass it to the data warehouse or another technique is to connect ourself to API and do their cods in order to extract the data or if the data is available in streaming like in kfka we can do event based streaming in order to extract the data another way is to use the change data capture CDC is as well something very similar to streaming or another way is by using web scrapping where you have a code that going to run and extract all the informations from the web so those are the different techniques and types that we have in the extraction now if you are talking on the transformation there are wide range of different Transformations that we can do on our data like for example doing data enrichment where we add values to our data sets or we do a data integration where we have multiple sources and we bring everything to one data model or we derive a new of columns based on already existing one another type of data Transformations we have the data normalization so the sources has values that are like a code and you go and map it to more friendly values for the analyzers which is more easier to understand and to use another Transformations we have the business rules and logic depend on the business you can Define different criterias in order to build like new columns and what belongs to Transformations is the data aggregation so here we aggregate the data to a different granularity and then we have type of transformation called Data cleansing there are many different ways on how to clean our data for example removing the duplicates doing data filtering handling the missing data handling invalid values or removing unwanted spaces casting the data types and detecting the outliers and many more so we have different types of data cleansing that we can do in our data warehouse and this is very important transformation so as you can see we have different types of Transformations that we can do in our data warehouse now moving on to the load so what do we have over here we have different processing types so either we are doing patch processing or stream processing patch processing means we are loading the data warehouse in one big patch of data that’s going to run and load the data warehouse so it is only one time job in order to refresh the content of the data warehouse and as well the reports so that means we are scheduling the data warehouse in order to load it in the day once or twice and the other type we have the stream processing so this means if there is like a change in the source system we going to process this change as soon as possible so we’re going to process it through all the layers of the data warehouse once something changes from The Source system so we are streaming the data in order to have real time data warehouse which is very challenging things to do in data warehousing and if you are talking about the loads we have two methods either we are doing a full load or incremental load it’s a same thing as extraction right so for the full load in databases there are like different methods on how to do it like for example we trate and then insert that means we make the table completely empty and then we insert everything from the scratch or another one you are doing an update insert we call it upsert so we can go and update all the records and then insert the new one and another way is to drop create an insert so that means we drop the whole table and then we create it from scratch and then we insert the data it is very similar to the truncate but here we are as well removing and drubbing the whole table so those are the different methods of full loads the incremental load we can use as well the upserts so update and inserts so we’re going to do an update or insert statements to our tables or if the source is something like a log we can do only inserts so we can go and Abend the data always to the table without having to update anything another way to do incremental load is to do a merge and here it is very similar to the upsert but as well with a delete so update insert delete so those are the different methods on how to load the data to your tables and one more thing in data warehousing we have something called slowly changing Dimensions so here it’s all about the hyz of your table and there are many different ways on how to handle the Hyer in your table the first type is sd0 we say there is no historization and nothing should be changed at all so that means you are not going to update anything the second one which is more famous it is the sd1 you are doing an override so that means you are updating the records with the new informations from The Source system by overwriting the old value so we are doing something like the upsert so update and insert but you are losing of course history another one we have the scd2 and here you want to add historization to your table so what we do so what we do each change that we get from The Source system that means we are inserting new records and we are not going to overwrite or delete the old data we are just going to make it inactive and the new record going to be active one so there are different methods on how to do historization as well while you are loading the data to the data warehouse all right so those are the different types and techniques that you might encounter in data management projects so now what I’m going to show you quickly which of those types we will be using in our projects so now if we are talking about the extraction over here we will be doing a pull extraction and about the full or incremental it’s going to be a full extraction and about the technique we are going to be passsing files to the data warehouse and now about the data transformation well this one we will cover everything all those types of Transformations that I’m showing you now is going to be part of the project because I believe in each data project you will be facing those Transformations now if we have a look to the load our project going to be patch processing and about the load methods we will be doing a full load since we have full extraction and it’s going to be trunk it and inserts and now about the historization we will be doing the sd1 so that means we will be updating the content of the thata Warehouse so those are the different techniques and types that we will be using in our ETL process for this project all right so with that we have now clear understanding what is a data warehouse and we are done with the theory parts so now the next step we’re going to start with the projects the first thing that you have to do is to prepare our environment to develop the projects so let’s start with that all right so now we go to the link in the description and from there we’re going to go to the downloads and and here you can find all the materials of all courses and projects but the one that we need now is the SQL data warehouse projects so let’s go to the link and here we have bunch of links that we need for the projects but the most important one to get all data and files is this one download all project files so let’s go and do that and after you do that you’re going to get a zip file where you have there a lot of stuff so let’s go and extract it and now inside it if you go over here you will find the reposter structure from git and the most important one here is the data ass sets so you have two sources the CRM and the Erp and in each one of them there are three CSV files so those are the data set for the project for the other stuffs don’t worry about it we will be explaining that during the project so go and get the data and put it somewhere at your PC where you don’t lose it okay so now what else do we have we have here a link to the get repository so this is the link to my repository that I have created through the projects so you can go and access it but don’t worry about it we’re going to explain the whole structure during the project and you will be creating your own repository and as well we have the link to the notion here we are doing the project management here you’re going to find the main steps the main phes of the SQL projects that we will do and as well all the task that we will be doing together during the projects and now we have links to the project tools so if you don’t have it already go and download the SQL Server Express so it’s like a server that going to run locally at your PC where your database going to live another one that you have to download is the SQL Server management Studio it is just a client in order to interact with the database and there we’re going to run all our queries and then link to the GitHub and as well link to the draw AO if you don’t have it already go and download it it is free and amazing tool in order to draw diagrams so through the project we will be drawing data models the data architecture a data lineage so a lot of stuff we’ll be doing using this tool so go and download it and the last thing it is nice to have you have a link to the notion where you can go and create of course free account accounts if you want to build the project plan and as well Follow Me by creating the project steps and the project tasks okay so that’s all those are all the links for the projects so go and download all those stuff create the accounts and once you are ready then we continue with the projects all right so now I hope that you have downloaded all the tools and created the accounts now it’s time to move to very important step that’s almost all people skip while doing projects and then that is by creating the project plan and for that we will be using the tool notion notion is of course free tool and it can help you to organize your ideas your plans and resources all in one place I use it very intensively for my private projects like for example creating this course and I can tell you creating a project plan is the key to success creating a data warehouse project is usually very complex and according to Gardner reports over 50% of data warehouse projects fail and my opinion about any complex project the key to success is to have a clear project plan so now at this phase of the project we’re going to go and create a rough project plan because at the moment we don’t have yet clear understanding about the data architecture so let’s go okay so now let’s create a new page and let’s call it data warehouse projects the first thing is that we have to go and create the main phases and stages of the projects and for that we need a table so in order to do that hit slash and then type database in line and then let’s go and call it something like data warehouse epic and we’re going to go and hide it because I don’t like it and then on the table we can go and rename it like for example project epics something like that and now what we’re going to do we’re going to go and list all the big task of the projects so an epic is usually like a large task that needs a lot of efforts in order to solve it so you can call it epics stages faces of the project whatever you want so we’re going to go and list our project steps so it start with the requirements analyzes and then designing data architecture and another one we have the project initialization so those are the three big task in the project first and now what do we need we need another table for the small chunks of the tasks the subtasks and we’re going to do the same thing so we’re going to go and hit slash and we’re going to search for the table in line and we’re going to do the same thing so first we’re going to call it data warehouse tasks and then we’re going to hide it and over here we’re going to rename it and say this is the project tasks so now what we’re going to do we’re going to go to the plus icon over here and then search for relation this one over here with the arrow and now we’re going to search for the name of the first table so we called it data warehouse iix so let’s go and click it and we’re going to say as well two-way relation so let’s go and add the relation so with that we got a fi in the new table called Data Warehouse iix this comes from this table and as well we have here data warehouse tasks that comes from from the below table so as you can see we have linked them together now what I’m going to do I’m going to take this to the left side and then what we’re going to do we’re going to go and select one of those epics like for example let’s take design the data architecture and now what we’re going to do we’re going to go and break down this Epic into multiple tasks like for example choose data management approach and then we have another task what we’re going to do we’re going to go and select as well the same epic so maybe the next step is brainstorm and design the layers and then let’s go to another iic for example the project initialization and we say over here for example create get repo prepare the structure we can go and make another one in the same epic let’s say we’re going to go and create the database and the schemas so as you can see I’m just defining the subtasks of those epics so now what we’re going to do we’re going to go and add a checkbox in order to understand whether we have done the task or not so we go to the plus and search for check we need the check box and what we’re going to do we’re going to make it really small like this and with that each time we are done with the task we’re going to go and click on it just to make sure that we have done the task now there is one more thing that is not really working nice and that is here we’re going to have like a long list of tasks and it’s really annoying so what we’re going to do we’re going to go to the plus over here and let’s search for roll up so let’s go and select it so now what we’re going to do we have to go and select the relationship it’s going to be that data warehouse task and after that we’re going to go to the property and make it as the check box so now as you can see in the first table we are saying how many tasks is closed but I don’t want to show it like this what you going to do we’re going to go to the calculation and to the percent and then percent checked and with that we can see the progress of our project and now instead of the numbers we can have really nice bar great so as well we can go and give it a name like progress so that’s it and we can go and hide the data warehouse tasks and now with that we have really nice progress bar for each epic and if we close all the tasks of this epic we can see that we have reached 100% so this is the main structure now we can go and add some cosmetics and rename stuff in order to make things looks nicer like for example if I go to the tasks over here I can go and call it tasks and as well go and change the icon to something like this and if you’d like to have an icon for all those epics what we going to do we’re going to go to the Epic for example design data architecture and then if you hover on top of the title you can see add an icon and you can go and pick any icon that you want so for example this one and now now as you can see we have defined it here in the top and the icon going to be as well in the pillow table okay so now one more thing that we can do for the project tasks is that we can go and group them by the epics so if you go to the three dots and then we go to groups and then we can group up by the epics and as you can see now we have like a section for each epic and you can go and sort the epics if you want if you go over here sort then manual and you can go over here and start sorting the epics as you want and with that you can expand and minimize each task if you don’t want to see always all tasks in one go so this is really nice way in order to build like data management for your projects of course in companies we use professional Tools in order to do projects like for example Gyra but for private person projects that I do I always do it like this and I really recommend you to do it not only for this project for any project that you are doing CU if you see the whole project in one go you can see the big picture and closing tasks and doing it like this these small things can makes you really satisfied and keeps you motivated to finish the whole project and makes you proud okay friends so now I just went and added few icons a rename stuff and as well more tasks for each epic and this going to be our starting point in the project and once we have more informations we’re going to go and add more details on how exactly we’re going to build the data warehouse so at the start we’re going to go and analyze and understand the requirements and only after that we’re going to start designing the data architecture and here we have three tasks first we have to to choose the data management approach and after that we’re going to do brainstorming and designing the layers of the data warehouse and at the end we’re going to go and draw a data architecture so with that we have clear understanding how the data architecture looks like and after that we’re going to go to the next epic where we’re going to start preparing our projects so once we have clear understanding of the data architecture the first task here is to go and create detailed project tasks so we’re going to go and add more epes and more tasks and once we are done then we’re going to go and create the naming conventions for the project just to make sure that we have rules and standards in the whole project and next we’re going to go and create a repository in the git and we can to prepare as well the structure of the repository so that we always commit our work there and then we can start with the first script where we can create a database and schemas so my friends this is the initial plan for the project now let’s start with the first epic we have the requirements analyzes now analyzing the requirement it is very important to understand which type of data wehous you’re going to go and build because there is like not only one standard on how to build it and if you go blindly implementing the data warehouse you might be doing a lot of stuff that is totally unnecessary and you will be burning a lot of time so that’s why you have to sit with the stockholders with the department and understand what we exactly have to build and depend on the requirements you design the shape of the data warehouse so now let’s go and analyze the requirement of this project now the whole project is splitted into two main sections the first section we have to go and build a data warehouse so this is a data engineering task and we will go and develop etls and data warehouse and once we have done that we have to go and build analytics and reporting business intelligence so we’re going to do data analysis but now first we will be focusing on the first part building the data warehouse so what do you have here the statement is very simple it says develop a modern data warehouse using SQL Server to consolidate sales data enabling analytical reporting and informed decision making so this is the main statements and then we have specifications the first one is about the data sources it says import data from two Source systems Erb and CRM and they are provided as CSV files and now the second task is talking about the data quality we have to clean and fix data quality issues before we do the data analyses because let’s be real there is no R data that is perfect is always missing and we have to clean that up now the next task is talking about the integration so it says we have to go and combine both of the sources into one single userfriendly data model that is designed for analytics and Reporting so that means we have to go and merge those two sources into one single data model and now we have here another specifications it says focus on the latest data sets so there is no need for historization so that means we don’t have to go and build histories in the the database and the final requirement is talking about the documentation so it says provide clear documentations of the data model so that means the last product of the data warehouse to support the business users and the analytical teams so that means we have to generate a manual that’s going to help the users that makes lives easier for the consumers of our data so as you can see maybe this is very generic requirements but it has a lot of information already for you so it’s saying that we have to use the platform SQL Server we have two Source systems using using the CSV files and it sounds that we really have a bad data quality in the sources and as well it wants us to focus on building completely new data model that is designed for reporting and it says we don’t have to do historization and it is expected from us to generate documentations of the system so these are the requirements for the data engineering part where we’re going to go and build a data warehouse that fulfill these requirements all right so with that we have analyzed the requirements and as well we have closed at the first easiest epic so we are done with this let’s go and close it and now let’s open another one here we have to design the data architecture and the first task is to choose data management approach so let’s go now designing the data architecture it is exactly like building a house so before construction starts an architect going to go and design a plan a blueprint for the house how the rooms will be connected how to make the house functional safe and wonderful and without this blueprint from The Architects the builders might create something unstable inefficient or maybe unlivable the same goes for data projects a data architect is like a house architect they design how your data will flow integrate and be accessed so as data Architects we make sure that the data warehouse is not only functioning but also scalable and easy to maintain and this is exactly what we will do now we will play the role of the data architect and we will start brainstorming and designing the architecture of the data warehouse so now I’m going to show you a sketch in order to understand what are the different approaches in order to design a data architecture and this phase of the projects usually is very exciting for me because this is my main role in data projects I am a data architect and I discuss a lot of different projects where we try to find out the best design for the projects all right so now let’s go now the first step of building a data architecture is to make very important decision to choose between four major types the first approach is to build a data warehouse it is very suitable if you have only structured data and your business want to build solid foundations for reporting and business intelligence and another approach is to build a data leak this one is way more flexible than a data warehouse where you can store not only structured data but as well semi and unstructured data we usually use this approach if you have mixed types of data like database tables locks images videos and your business want to focus not only on reporting but as well on Advanced analytics or machine learning but it’s not that organized like a data warehouse and data leaks if it’s too much unorganized can turns into Data swamp and this is where we need the next approach so the next one we can go and build data leak house so it is like a mix between data warehouse and data leak you get the flexibility of having different types of data from the data Lake but you still want to structure and organiz your data like we do in the data warehouse so you mix those two words into one and this is a very modern way on how to build data Architects and this is currently my favorite way of building data management system now the last and very recent approach is to build data Mish so this is a little bit different instead of having centralized data management system the idea now in the data Mish is to make it decentralized you cannot have like one centralized data management system because always if you say centralized then it means bottleneck so instead you have multiple departments and multiple domains where each one of them is building a data product and sharing it with others so now you have to go and pick one of those approaches and in this project we will be focusing on the data warehouse so now the question is how to build the data warehouse well there is as well four different approaches on how to build it the first one is the inone approach so again you have your sources and the first layer you start with the staging where the row data is landing and then the next layer you organize your data in something called Enterprise data Warehouse where you go and model the data using the third normal format it’s about like how to structure and normalize your tables so you are building a new integrated data model from the multiple sources and then we go to the third layer it’s called the data Mars where you go and take like small subset of the data warehouse and you design it in a way that is ready to be consumed from reporting and it focus on only one toque like for example the customers sales or products and after that you go and connect your bi tool like powerbi or Tableau to the data Mars so with that you have three layers to prepare the data before reporting now moving on to the next one we have the kle approach he says you know what building this Enterprise data warehouse it is wasting a lot of time so what we can do we can jump immediately from the stage layer to the final data marks because building this Enterprise data warehouse it is a big struggle and usually waste a lot of time so he always want you to focus and building the data marks quickly as possible so it is faster approach than Inon but with the time you might get chaos in the data Mars because you are not always focusing in the big picture and you might be repeating same Transformations and Integrations in different data Mars so there is like trade-off between the speed and consistent data warehouse now moving on to the third approach we have the Data Vault so we still have the stage and the data Mars but it says we still need this Central Data Warehouse in the middle but this middle layer we’re going to bring more standards and rules so it tells you to split this middle layer into two layers the row Vault and the business vault in the row Vault you have the original data but in the business Vault you have all the business rules and Transformations that prepares the data for the data Mars so Data Vault it is very similar to the in one but it brings more standards and rules to the middle layer now I’m going to go and add a fourth one that I’m going to call it Medallion architecture and this one is my favorite one because it is very easy to understand and to build so it says you’re going to go and build three layers bronze silver and gold the bronze layer it is very similar to the stage but we have understood with the time that the stage layer is very important because having the original data as it is it going to helps a lot by tracebility and finding issues then the next layer we have the silver layer it is where we do Transformations data cleansy but we don’t apply yet any business rules now moving on to the last layer the gold layer it is as well very similar to the data Mars but there we can build different typ type of objects not only for reporting but as well for machine learning for AI and for many different purposes so they are like business ready objects that you want to share as a data product so those are the four approaches that you can use in order to build a data warehouse so again if you are building a data architecture you have to specify which approach you want to follow so at the start we said we want to build a data warehouse and then we have to decide between those four approaches on how to build the data warehouse and in this project we will be using using The Medallion architecture so this is a very important question that you have to answer as the first step of building a data architecture all right so with that we have decided on the approach so we can go and Mark it as done the next step we’re going to go and design the layers of the data warehouse now there is like not 100% standard way and rules for each layer what you have to do as a data architect you have to Define exactly what is the purpose of each layer so we start with the bronze layer so we say it going to store row and unprocessed data as it is from the sources and why we are doing that it is for tracebility and debugging if you have a layer where you are keeping the row data it is very important to have the data as it is from the sources because we can go always back to the pron layer and investigate the data of specific Source if something goes wrong so the main objective is to have row untouched data that’s going to helps you as a data engineer by analyzing the road cause of issues now moving on to the silver layer it is the layer where we’re going to store clean and standardized data and this is the place where we’re going to do basic transformations in order to prepare the data for the final layer now for the good layer it going to contain business ready data so the main goal here is to provide data that could be consumed by business users and analysts in order to build reporting and analytics so with that we have defined the main goal for each layer now next what I would like to do is to to define the object types and since we are talking about a data warehouse in database we have here generally two types either a table or a view so we are going for the bronze layer and the silver layer with tables but for the gold layer we are going with the views so the best practice says for the last layer in your data warehouse make it virtual using views it going to gives you a lot of dynamic and of course speed in order to build it since we don’t have to make a load process for it and now the next step is that we’re going to go and Define the load method so in this project I have decided to go with the full load using the method of trating and inserting it is just faster and way easier so we’re going to say for the pron layer we’re going to go with the full load and you have to specify as well for the silver layer as well we’re going to go with the full load and of course for the views we don’t need any load process so each time you decide to go with tables you have to define the load methods with full load incremental loads and so on now we come to the very interesting part the data Transformations now for the pron layer it is the easiest one about this topic because we don’t have any transformations we have to commit ourself to not touch the data do not manipulate it don’t change anything so it’s going to stay as it is if it comes bad it’s going to stay bad in the bronze layer and now we come to the silver layer where we have the heavy lifting as we committed in the objective we have to make clean and standardized data and for that we have different types of Transformations so we have to do data cleansing data standardizations data normalizations we have to go and derive new columns and data enrichment so there are like bunch of trans transformation that we have to do in order to prepare the data our Focus here is to transform the data to make it clean and following standards and try to push all business transformations to the next layer so that means in the god layer we will be focusing on business Transformations that is needed for the consumers for the use cases so what we do here we do data Integrations between Source system we do data aggregations we apply a lot of business Logics and rules and we build a data model that is ready for for example business inions so here we do a lot of business Transformations and in the silver layer we do basic data Transformations so it is really here very important to make the fine decisions what type of transformations to be done in each layer and make sure that you commit to those rules now the next aspect is about the data modeling in the bronze layer and the silver layer we will not break the data model that comes from the source system so if the source system deliver five tables we’re going to have here like five tables and as well in the silver layer we will not go and D normalize or normalize or like make something new we’re going to leave it exactly like it comes from the source system because what we’re going to do we’re going to build the data model in the gold layer and here you have to Define which data model you want to follow are you following the star schema the snowflake or are you just making aggregated objects so you have to go and make a list of all data models types that you’re going to follow in the gold layer and at the end what you can specify in each layer is the target audience and this is of course very important decision in the bronze layer you don’t want to give access access to any end user it is really important to make sure that only data Engineers access the bronze layer it makes no sense for data analysts or data scientist to go to the bad data because you have a better version for that in the silver layer so in the silver layer of course the data Engineers have to have an access to it and as well the data analysts and the data scientist and so on but still you don’t give it to any business user that can’t deal with the row data model from the sources because for the business users you’re going to get a bit layer for them and that is the gold layer so the gold layer it is suitable for the data analyst and as well the business users because usually the business users don’t have a deep knowledge on the technicality of the Sero layer so if you are designing multiple layers you have to discuss all those topics and make clear decision for each layer all right my friends so now before we proceed with the design I want to tell you a secret principle Concepts that each data architect must know and that is the separation of concerns so what is that as you are designing an architecture you have to make sure to break down the complex system into smaller independent parts and each part is responsible for a specific task and here comes the magic the component of your architecture must not be duplicated so you cannot have two parts are doing the same thing so the idea here is to not mix everything and this is one of the biggest mistakes in any big projects and I have sewn that almost everywhere so a good data architect follow this concept this principle so for example if you are looking to our data architecture we have already done that so we have defined unique set of tasks for each layer so for example we have said in the silver layer we do data cleansing but in the gold layer we do business Transformations and with that you will not be allowing to do any business transformations in the silver layer and the same thing goes for the gold layer you don’t do in the gold layer any data cleansing so each layer has its own unique tasks and the same thing goes for the pron layer and the silver layer you do not allow to load data from The Source systems directly to the silver layer because we have decided the landing layer the first layer is the pron layer otherwise you will have like set of source systems that are loaded first to the pron layer and another set is skipping the layer and going to the silver and with that we have overlapping you are doing data inje in two different layers so my friends if you have this mindsets separation of concerns I promise you you’re going to be a data architect so think about it all right my friends so with that we have designed the layers of the data warehouse we can go and close it the next step we’re going to go to draw o and start drawing the data architecture so there is like no one standard on how to build a data architecture you can add your style and the way that you want so now the first thing that we have to show in data architecture is the different layers that we have the first layer is the source system layer so let’s go and take a box like this and make it a little bit bigger and I’m just going to go and make the design so I’m going to remove the fill and make the line dotted one and after dots I’m going to go and change maybe the color to something like this gray so now we have like a container for the first layer and then we have to go and add like a text on top of it so what I’m going to do I’m going to take another box let’s go and type inside it sources and I’m going to go and style it so I’m going to go to the text and make it maybe 24 and then remove the lines like this make it a little bit smaller and put it on top so this is the first layer this is where the data come from and then the data going to go inside a data warehouse so I’m just going to go and duplicate this one this one is the data warehouse all right so now the third layer what is going to be it’s going to be the consumers who will be consuming this data warehouse so I’m going to put another box and say this is the consume layer okay so those are the three containers now inside the data warehouse we have decided to build it using the Medan architecture so we’re going to have three layers inside the warehouse so I’m going to take again another box I’m going to call this one this is the bronze layer and now we have to go and put a design for it so I’m going to go with this color over here and then the text and maybe something like 20 and then make it a little bit smaller and just put it here and beneath that we’re going to have the component so this is just a title of a container so I’m going to have it like this this remove the text from inside it and remove the filling so this container is for the bronze layer let’s go and duplicate it for the next one so this one going to be the silver layer and of course we can go and change the coloring to gray because it is silver and as well the lines and remove the filling great and now maybe I’m going to make the font as bold all right now the third layer going to be the gold layer and we have to go and pick it color for that so style and here we have like something like yellow the same thing for the container I remove the filling so with that we are showing now the different layers inside our data warehouse now those containers are empty what we’re going to do we’re going to go inside each one of them and start adding contents so now in the sources it is very important to make it clear what are the different types of source system that you are connecting to the data warehouse because in real project there are like multiple types you might have a database API files CFA and here it’s important to show those different types in our projects we have folders and inside those folders We have CSV files so now what you have to do we have to make it clear in this layer that the input for our project is CSV file so it really depend how you want to show that I’m going to go over here and say maybe folder and then I’m going to go and take the folder and put it here inside and then maybe search for file more results and go pick one of those icons for example I’m going to go with this one over here so I’m going to make it smaller and add it on top of the folder so with that we make it clear for everyone seeing the architecture that the sources is not a database is not an API it is a file inside the folder so now very important here to show is the source systems what are the sources that is involved in the project so here what we’re going to do we’re going to go and give it a name for example we have one source called CRM B like this and maybe make the icon and we have another source called Erp so we going to go and duplicate it put it over here and then rename it Erp so now it is for everyone clear we have two sources for the this project and the technology is used is simply a file so now what we can do as well we can go and add some descriptions inside this box to make it more clear so what I’m going to do I’m going to take a line because I want to split the description from the icons something like this and make it gray and then below it we’re going to go and add some text and we’re going to say is CSV file and the next point and we can say the interface is simply files in folder and of course you can go and add any specifications and explanation about the sources if it is a database you can see the type of the database and so on so with that we made it in the data architecture clear what are the sources of our data warehouse and now the next step what we’re going to do we’re going to go and design the content of the bronze silver and gold so I’m going to start by adding like an icon in each container it is to show about that we are talking about database so what we’re going to do we’re going to go and search for database and then more result more results I’m going to go with this icon over here so let’s go and make it it’s bigger something like this maybe change the color of that so we’re going to have the bronze and as well here the silver and the gold so now what we’re going to do we’re going to go and add some arrows between those layers so we’re going to go over here so we can go and search for Arrow and maybe go and pick one of those let’s go and put it here and we can go and pick a color for that maybe something like this and adjust it so now we can have this nice Arrow between all the layers just to explain the direction of our architecture right so we can read this from left to right and as well between the gold layer and the consume okay so now what I’m going to do next we’re going to go and add one statement about each layer the main objective so let’s go and grab a text and put it beneath the database and we’re going to say for example for the bl’s layer it’s going to be the row data maybe make the text bigger so you are the row data and then the next one in the silver you are cleans standard data and then the last one for the gos we can say business ready data so with that we make the objective clear for each layer now below all those icons what we going to do we’re going to have a separator again like this make it like colored and beneath it we’re going to add the most important specifications of this layer so let’s go and add those separators in each layer okay so now we need a text below it let’s take this one here so what is the object type of the bronze layer it’s going to be a table and we can go and add the load methods we say this is patch processing since we are not doing streaming we can say it is a full load we are not doing incremental load so we can say here Tran and insert and then we add one more section maybe about the Transformations so we can say no Transformations and one more about the data model we’re going to say none as is and now what I’m going to do I’m going to go and add those specifications as well for the silver and gold so here what we have discussed the object type the load process the Transformations and whether we are breaking the data model or not the same thing for the gold layer so I can say with that we have really nice layering of the data warehouse and what we are left is with the consumers over here you can go and add the different use cases and tools that can access your data warehouse like for example I’m adding here business intelligence and Reporting maybe using poweri or Tau or you can say you can access my data warehouse in order to do atoc analyzes using the SQ queries and this is what we’re going to focus on the projects after we buil the data warehouse and as well you can offer it for machine learning purposes and of course it is really nice to add some icons in your architecture and usually I use this nice websites called Flat icon it has really amazing icons that you can go and use it in your architecture now of course we can go and keep adding icons and stuff to explain the data architecture and as well the system like for example it is very important here to say which tools you are using in order to build this data warehouse is it in the cloud are you using Azure data breaks or maybe snowflake so we’re going to go and add for our project the icon of SQL Server since we are building this data warehouse completely in the SQL Server so for now I’m really happy about it as you can see we have now a plan right all right guys so with that we have designed the data architecture using the drw O and with that we have done the last step in this epic and now with that we have a design for the data architecture and we can say we have closed this epic now let’s go to the next one we will start doing the first step to prepare our projects and the first task here is to create a detailed project plan all right my friends so now it’s clear for us that we have three layers and we have to go and build them so that means our big epic is going to be after the layers so here I have added three more epics so we have build bronze layer build silver layer and gold layer and after that I went and start defining all the different tasks that we have to follow in the projects so at the start will be analyzing then coding and after that we’re going to go and do testing and once everything is ready we’re going to go and document stuff and at the end we have to commit our work in the get repo all those epics are following the same like pattern in the tasks so as you can see now we have a very detailed project structure and now things are more cleared for us how we going to build the data warehouse so with that we are done from this task and now the next task we have to go and Define the naming Convention of the projects all right so now at this phase of the projects we usually Define the naming conventions so what is that it a set of rules that you define for naming everything in the projects whether it is a database schema tables start procedures folders anything and if you don’t do that at the early phase of the project I promise you chaos can happen because what going to happen you will have different developers in your projects and each of those developers have their own style of course so one developer might name a tabled Dimension customers where everything is lowercase and between them underscore and you have another developer creating another table called Dimension products but using the camel case so there is no separation between the words and the first character is capitalized and maybe another one using some prefixes like di imore categories so we have here like a shortcut of the dimension so as you can see there are different designs and styles and if you leave the door open what can happen in the middle of the projects you will notice okay everything looks inconsistence and you can define a big task to go and rename everything following specific role so instead of wasting all this time at this phase you go and Define the naming conventions and let’s go and do that so we will start with a very important decision and that is which naming convention we going to follow in the whole project so you have different cases like the camel case the Pascal case the Kebab case and the snake case and for this project we’re going to go with the snake case where all the letters of award going to be lowercase and the separation between wordss going to be an underscore for example a table name called customer info customer is lowercased info is as well lowercased and between them an underscore so this is always the first thing that you have to decide for your data project the second thing is to decide the language so for example I work in Germany and there is always like a decision that we have to make whether we use Germany or English so we have to decide for our project which language we’re going to use and a very important general rule is that avoid reserved words so don’t use a square reserved word as an object name like for example table don’t give a table name as a table so those are the general principles so those are the general rules that you have to follow in the whole project this applies for everything for tables columns start procedures any names that you are giving in your scripts now moving on we have specifications for the table names and here we have different set of rules for each layer so here the rule says Source system uncore entity so we are saying all the tables in the bronze layer should start first with the source system name like for example CRM or Erb and after that we have an underscore and then at the end we have the entity name or the table name so for example we have this table name CRM uncore so that means this table comes from the source system CRM and then we have the table name the entity name customer info so this is the rule that we’re going to follow in naming all tables in the pron layer then moving on to the silver layer it is exactly like the bronze because we are not going to rename anything we are not going to build any new data model so the naming going to be one to one like the bronze so it is exactly the same rules as the bronze but if we go to the gold here since we are building new data model we have to go and rename things and since as well we are integrating multi sources together we will not be using the source system name in the tables because inside one table you could have multiple sources so the rule says all the names must be meaningful business aligned names for the tables starting with the category prefix so here the rule says it start with category then underscore and then entity now what is category we have in the go layer different types of tables so we could build a table called a fact table another one could be a dimension a third type could be an aggregation or report so we have different types of tables and we can specify those types as a perect at the start so for example we are seeing here effect uncore sales so the category is effect and the table name called sales and here I just made like a table with different type of patterns so we could have a dimension so we say it start with the di imore for example the IM customers or products and then we have another type called fact table so it starts with fact underscore or aggregated table where we have the fair three characters like aggregating the customers or the sales monthly so as you can see as you are creating a naming convention you have first to make it clear what is the rule describe each part of the rule and start giving examples so with that we make it clear for the whole team which names they should follow so we talked here about the table naming convention then you can as well go and make naming convention for the columns like for example in the gold layer we’re going to go and have circuit keys so we can Define it like this the circuit key should start with a table name and then underscore a key like for example we can call it customer underscore key it is a surrogate key in the dimension customers the same thing for technical columns as a data engineer we might add our own columns to the tables that don’t come from the source system and those columns are the technical columns or sometimes we call them metadata columns now in order to separate them from the original columns that comes from the source system we can have like a prefix for that like for example the rule says if you are building any technical or metadata columns the column should start with dwore and then that column name for example if you want the metadata load date we can have dwore load dates so with that if anyone sees that column starts with DW we understand this data comes from a data engineer and we can keep adding rules like for example the St procedure over here if you are making an ETL script then it should should start with the prefix load uncore and then the layer for example the St procedure that is responsible for loading the bronze going to be called load uncore bronze and for the Silver Load uncore silver so those are currently the rules for the St procedure so this is how I do it usually in my projects all right my friends so with do we have a solid namey conventions for our projects so this is done and now the next with that we’re going to go to git and you will create a brand new repository and we’re going to prepare its structure so let’s go go all right so now we come to as well important step in any projects and that’s by creating the git repository so if you are new to git don’t worry about it it is simpler than it sounds so it’s all about to have a safe place where you can put your codes that you are developing and you will have the possibility to track everything happen to the codes and as well you can use it in order to collaborate with your team and if something goes wrong you can always roll back and the best part here once you are done with the project you can share your reposter as a part of your portfolio and it is really amazing thing if you are applying for a job by showcasing your skills that you have built a data warehouse by using well documented get reposter so now let’s go and create the reposter of the project now we are at the overview of our account so the first thing that you have to do is to go to the repos stories over here and then we’re going to go to this green button and click on you the first thing that we have to do is to give Theory name so let’s call it SQL data warehouse project and then here we can go and give it a description so for example I’m saying building a modern data warehouse with SQL Server now the next option whether you want to make it public and private I’m going to leave it as a public and then let’s go and add here a read me file and then here about the license we can go over here and select the MIT MIT license gives everyone the freedom of using and modifying your code okay so I think I’m happy with the setup let’s go and create the repost story and with that we have our brand new reposter now the next step that I usually do is to create the structure of the reposter and usually I always follow the same patterns in any projects so here we need few folders in order to put our files right so what I usually do I go over here to add file create a new file and I start creating the structure over here so the first thing is that we need data sets then slash and with that the repos you can understand this is a folder not a file and then you can go and add anything like here play holder just an empty file this just can to help me to create the folders so let’s go and commit so commit the changes and now if you go back to the main projects you can see now we have a folder called data sets so I’m going to go and keep creating stuff so I will go and create the documents placeholder commit the changes and then I’m going to go and create the scripts Place holder and the final one what I usually add is the the tests something like this so with that as you can see now we have the main folders of our repository now what I usually do the next with that I’m going to go and edit the main readme so you can see it over here as well so what we’re going to do we’re going to go inside the read me and then we’re going to go to the edit button here and we’re going to start writing the main information about our project this is really depend on your style so you can go and add whatever you want this is the main page of your repository and now as you can see the file name here ismd it stands for markdown it is just an easy and friendly format in order to write a text so if you have like documentations you are writing a text it is a really nice format in order to organize it structure it and it is very friendly so what I’m going to do at the start I’m going to give a few description about the project so we have the main title and then we have like a welcome message and what this reposter is about and in the next section maybe we can start with the project requirements and then maybe at the end you can say few words about the licensing and few words about you so as you can see it’s like the homepage of the project and the repository so once you are done we’re going to go and commit the changes and now if you go to the main page of the repository you can see always the folder and files at the start and then below it we’re going to see the informations from the read me so again here we have the welcome statement and then the projects requirements and at the end we have the licensing and about me so my friends that’s that’s it we have now a repost story and we have now the main structure of the projects and through the projects as we are building the data warehouse we’re going to go and commit all our work in this repository nice right all right so with that we have now your repository ready and as we go in the projects we will be adding stuff to it so this step is done and now the last step finally we’re going to go to the SQL server and we’re going to write our first scripts where we’re going to create a database and schemas all right now the first step is we have to go and create brand new database so now in order to do that first we have to switch to the database master so you can do it like this use master and semicolon and if you go and execute it now we are switched to the master database it is a system database in SQL Server where you can go and create other databases and you can see from the toolbar that we are now logged into the master database now the next step we have to go and create our new database so we’re going to say say create database and you can call it whatever you want so I’m going to go with data warehouse semicolon let’s go and execute it and with that we have created our database let’s go and check it from the object Explorer let’s go and refresh and you can see our new data warehouse this is our new database awesome right now to the next step we’re going to go and switch to the new database so we’re going to say use data warehouse and semicolon so let’s go and switch to it and you can see now now we are logged into the data warehouse database and now we can go and start building stuff inside this data warehouse so now the first step that I usually do is I go and start creating the schemas so what is the schema think about it it’s like a folder or a container that helps you to keep things organized so now as we decided in the architecture we have three layers bronze silver gold and now we’re going to go and create for each layer a schema so let’s go and do that we’re going to start with the first one create schema and the first one is bronze so let’s do it like this and a semicolon let’s go and create the first schema nice so we have new schema let’s go to our database and then in order to check the schemas we go to the security and then to the schemas over here and as you can see we have the bronze and if you don’t find it you have to go and refresh the whole schemas and then you will find the new schema great so now we have the first schema now what we’re going to do we’re going to go and create the others two so I’m just going to go and duplicate it so the next one going to be the silver and the third one going to be the golds so let’s go and execute those two together we will get an error and that’s because we are not having the go in between so after each command let’s have a go and now if I highlight the silver and gold and then execute it will be working the go in SQL it is like separator so it tells SQL first execute completely the First Command before go to the next one so it is just separator now let’s go to our schemas refresh and now we can see as well we have the gold and the silver so with this we have now a database we have the three layers and we can start developing each layer individually okay so now let’s go and commit our work in the git so now since it is a script and code we’re going to go to the folder scripts over here and then we’re going to go and add a new file let’s call it init database.sql and now we’re going to go and paste our code over here so now I have done few modifications like for example before we create the database we have to check whether the database exists this is an important step if you are recreating the database otherwise if you don’t do that you will get an error where it’s going going to say the database already exists so first it is checking whether the database exist then it drops it I have added few comments like here we are saying creating the data warehouse creating the schemas and now we have a very important step we have to go and add a header comment at the start of each scripts to be honest after 3 months from now you will not be remembering all the details of these scripts and adding a comment like this it is like a sticky note for you later once you visit this script again and it is as well very important for the other developers in the team because each time you open a scripts the first question going to be what is the purpose of this script because if you or anyone in the team open the file the first question going to be what is the purpose of these scripts why we are doing these stuff so as you can see here we have a comment saying this scripts create a new data warehouse after checking if it already exists if the database exists it’s going to drop it and recreate it and additionally it’s going to go and create three schemas bronze silver gold so that it gives Clarity what this script is about and it makes everyone life easier now the second reason why this is very important to add is that you can add warnings and especially for this script it is very important to add these notes because if you run these scripts what’s going to happen it’s going to go and destroy the whole database imagine someone open the script and run it imagine an admin open the script and run it in your database everything going to be destroyed and all the data will be lost and this going to be a disaster if you don’t have any backup so with that we have nice H our comment and we have added few comments in our codes and now we are ready to commit our codes so let’s go and commit it and now we have our scripts in the git as well and of course if you are doing any modifications make sure to update the changes in the Gs okay my friends so with that we have an empty database and schemas and we are done with this task and as well we are done with the whole epic so we have completed the project initialization and now we’re going to go to the interesting stuff we will go and build the bronze layer so now the first task is to analyze the source systems so let’s go all right so now the big question is how to build the bronze layer so first thing first we do analyzing as you are developing anything you don’t immediately start writing a code so before we start coding the bronze layer what we usually do is we have to understand the source system so what I usually do I make an interview with the source system experts and ask them many many questions in order to understand the nature of the source system that I’m connecting to the data warehouse and once you know the source systems then we can start coding and the main focus here is to do the data ingestion so that means we have to find a way on how to load the data from The Source into the data warehouse so it’s like we are building a bridge between the source and our Target system the data warehouse and once we have the code ready the next step is we have to do data validation so here comes the quality control it is very important in the bronze layer to check the data completeness so that means we have to compare the number of Records between the source system and the bronze layer just to make sure we are not losing any data in between and another check that we will be doing is the schema checks and that’s to make sure that the data is placed on the right position and finally we don’t have to forget about documentation and committing our work in the gits so this is the process that we’re going to follow to build the bronze layer all right my friends so now before connecting any Source systems to our data warehouse we have to make very important step is to understand the sources so how I usually do it I set up a meeting with the source systems experts in order to interview them to ask them a lot of stuff about the source and gaining this knowledge is very important because asking the right question will help you to design the correct scripts in order to extract the data and to avoid a lot of mistakes and challenges and now I’m going to show you the most common questions that I usually ask before connecting anything okay so we start first by understanding the business context and the ownership so I would like to understand the story behind the data I would like to understand who is responsible for the data which it departments and so on and then it’s nice to understand as well what business process it supports does it support the customer transactions the supply chain Logistics or maybe Finance reporting so with that you’re going to understand the importance of your data and then I ask about the system and data documentation so having documentations from the source is your learning materials about your data and it going to saves you a lot of time later when you are working and designing maybe new data models and as well I would like always to understand the data model for the source system and if they have like descript I of the columns and the tables it’s going to be nice to have the data catalog this can helps me a lot in the data warehouse how I’m going to go and join the tables together so with that you get a solid foundations about the business context the processes and the ownership of the data and now in The Next Step we’re going to start talking about the technicality so I would like to understand the architecture and as well the technology stack so the first question that I usually ask is how the source system is storing the data do we have the data on the on Prem like an SQL Server Oracle or is it in the cloud like Azure lws and so on and then once we understand that then we can discuss what are the integration capabilities like how I’m going to go and get the data do the source system offer apis maybe CFA or they have only like file extractions or they’re going to give you like a direct connection to the database so once you understand the technology that you’re going to use in order to extract the data then we’re going to Deep dive into more technical questions and here we can understand how to extract the data from The Source system and and then load it into the data warehouse so the first things that we have to discuss with the experts can we do an incremental load or a full load and then after that we’re going to discuss the data scope the historization do we need all data do we need only maybe 10 years of the data are there history is already in the source system or should we build it in the data warehouse and so on and then we’re going to go and discuss what is the expected size of the extracts are we talking here about megabytes gigabytes terabytes and this is very important to understand whether we have the right tools and platform to connect the source system and then I try to understand whether there are any data volume limitations like if you have some Old Source systems they might struggle a lot with performance and so on so if you have like an ETL that extracting large amount of data you might bring the performance down of the source system so that’s why you have to try to understand whether there are any limitations for your extracts and as well other aspects that might impact the performance of The Source system this is very important if they give you an access to the database you have to be responsible that you are not bringing the performance of the database down and of course very important question is to ask about the authentication and the authorization like how you going to go and access the data in the source system do you need any tokens Keys password and so on so those are the questions that you have to ask if you are connecting new source system to the data warehouse and once you have the answers for those questions you can proceed with the next steps to connect the sources to the that Warehouse all right my friends so with that you have learned how to analyze a new source systems that you want to connect to your data warehouse so this STP is done and now we’re going to go back to coding where we’re going to write scripts in order to do the data ingestion from the CSV files to the Bros layer and let’s have quick look again to our bronze layer specifications so we just have to load the data from the sources to the data warehouse we’re going to build tables in the bronze layer we are doing a full load so that means we are trating and then inserting the data there will be no data Transformations at all in the bronze layer and as well we will not be creating any data model so this is the specifications of the bronze layer all right now in order to create the ddl script for the bronze layer creating the tables of the bronze we have to understand the metadata the structure the schema of the incoming data and here either you ask the technical experts from The Source system about these informations or you can go and explore the incoming data and try to define the structure of your tables so now what we’re going to do we’re going to start with the First Source system the CRM so let’s go inside it and we’re going to start with the first table that customer info now if you open the file and check the data inside it you see we have a Header information and that is very good because now we have the names of the columns that are coming from the source and from the content you can Define of course the data types so let’s go and do that first we’re going to say create table and then we have to define the layer it’s going to be the bronze and now very important we have to follow the naming convention so we start with the name of the source system it is the CRM underscore and then after that the table name from The Source system so it’s going to be the costore info so this is the name of our first table in the bronze layer then the next step we have to go and Define of course the columns and here again the column names in the bronze layer going to be one to one exactly like the source system so the first one going to be the ID and I will go with the data type integer then the next one going to be the key invar Char and the length I will go with [Music] 50 and the last one going to be the create dates it’s going to be date so with that we have covered all the columns available from The Source system so let’s go and check and yes the last one is the create date so that’s it for the first table now semicolon of course at the end let’s go and execute it and now we’re going to go to the object Explorer over here refresh and we can see the first table inside our data warehouse amazing right so now next what you have to do is to go and create a ddl statement for each file for those two systems so for the CRM we need three ddls and as well for the other system the Erp we have as well to create three ddls for the three files so at the ends we’re going to have in the bronze ler Six Tables six ddls so now pause the video go create those ddls I will be doing the same as well and we will see you soon all right so now I hope you have created all those details I’m going to show you what I have just created so the second table in the source CRM we have the product informations and the third one is the sales details then we go to the second system and here we make sure that we are following the naming convention so first The Source system Erb and then the table name so the second system was really easy you can see we have only here like two columns and for the customers like only three and for the categories only four columns all right so after defining those stuff of course we have to go and execute them so let’s go and do that and then we go to the object Explorer over here refresh the tables and with that you can see we have six empty tables in the bronze layer and with that we have all the tables from the two Source systems inside our database but still we don’t have any data and you can see our naming convention is really nice you see the first three tables comes from the CRM Source system and then the other three comes from the Erb so we can see in the bronze layer the things are really splitted nicely and you can identify quickly which table belonged to which source system now there is something else that I usually add to the ddl script is to check whether the table exists before creating so for example let’s say that you are renaming or you would like to change the data type of specific field if you just go and run this Square you will get an error because the database going to say we have already this table so in other databases you can say create or replace table but in the SQL Server you have to go and build a tsql logic so it is very simple first we have to go and check whether the object exist in the database so we say if object ID and then we have to go and specify the table name so let’s go and copy the whole thing over here and make sure you get exactly the same name as a table name so there is see like space I’m just going to go and remove it and then we’re going to go and Define the object type so going to be the U it stands for user it is the user defined tables so if this table is not null so this means the database did find this object in the database so what can happen we say go and drop the table so the whole thing again and semicolon so again if the table exist in the database is not null then go and drop the table and after that go and created so now if you go and highlight the whole thing and then execute it it will be working so first drop the table if it exist then go and create the table from scratch now what you have to do is to go and add this check before creating any table inside our database so it’s going to be the same thing for the next table and so on I went and added all those checks for each table and what can happen if I go and execute the whole thing it going to work so with that I’m recreating all the tables in the bronze layer from the scratch now the methods that we’re going to use in order to load the data from the source to the data warehouse is the bulk inserts bulk insert is a method of loading massive amount of data very quickly from files like CSV files or maybe a text file directly into a database it’s is not like the classical normal inserts where it’s going to go and insert the data row by row but instead the PK insert is one operation that’s going to load all the data in one go into the database and that’s what makes it very fast so let’s go and use this methods okay so now let’s start writing the script in order to load the first table in the source CRM so we’re going to go and load the table customer info from the CSV file to the database table so the syntax is very simple we’re going to start to saying pulk insert so with that SQL understand we are doing not a normal insert we are doing a pulk insert and then we have to go and specify the table name so it is bronze. CRM cost info so now now we have to specify the full location of the file that we are trying to load in this table so now what we have to do is to go and get the path where the file is stored so I’m going to go and copy the whole path and then add it to the P insert exactly like where the data exists so for me it is in csql data warehouse project data set in the source CRM and then I have to specify the file name so it’s going to be the costore info. CSV you have to get it exactly like like the path of your files otherwise it will not be working so after the path now we come to the with CLA now we have to tell the SQL Server how to handle our file so here comes the specifications there is a lot of stuff that we can Define so let’s start with the very important one is the row header now if you check the content of our files you can see always the first row includes the Header information of the file so those informations are actually not the data it’s just the column names the ACT data starts from the second row and we have to tell the database about this information so we’re going to say first row is actually the second row so with that we are telling SQL to skip the first row in the file we don’t need to load those informations because we have already defined the structure of our table so this is the first specifications the next one which is as well very important and loading any CSV file is the separator between Fields the delimiter between Fields so it’s really depend on the file structure that you are getting from the source as you can see all those values are splitted with a comma and we call this comma as a file separator or a delimiter and I saw a lot of different csvs like sometime they use a semicolon or a pipe or special character like a hash and so on so you have to understand how the values are splitted and in this file it’s splitted by the comma and we have to tell SQL about this info it’s very important so we going to say fill Terminator and then we’re going to say it is the comma and basically those two informations are very important for SQL in order to be able to read your CSV file now there are like many different options that you can go and add for example tabe lock it is an option in order to improve the performance where you are locking the entire table during loading it so as SQL is loading the data to this table it going to go and lock the whole table so that’s it for now I’m just going to go and add the semicolon and let’s go and insert the data from the file inside our pron table let’s execute it and now you can see SQL did insert around 880,000 rows inside our table so it is working we just loaded the file into our data Bas but now it is not enough to just write the script you have to test the quality of your bronze table especially if you are working with files so let’s go and just do a simple select so from our new table and let’s run it so now the first thing that I check is do we have data like in each column well yes as you can see we have data and the second thing is do we have the data in the correct column this is very critical as you are loading the data from a file to a database do we have the data in the correct column so for example here we have the first name which of course makes sense and here we have the last name but what could happen and this mistakes happens a lot is that you find the first name informations inside the key and as well you see the last name inside the first name and the status inside the last name so there is like shifting of the data and this data engineering mistake is very common if you are working with CSV files and there are like different reasons why it happens maybe the definition of your table is wrong or the filled separator is wrong maybe it’s not a comma it’s something else or the separator is a bad separator because sometimes maybe in the keys or in the first name there is a comma and the SQL is not able to split the data correctly so the quality of the CSV file is not really good and there are many different reasons why you are not getting the data in the correct column but for now everything looks fine for us and the next step is that I go and count the rows inside this table so let’s go and select that so we can see we have 18,490 and now what we can do we can go to our CSV file and check how many rows do we have inside this file and as you can see we have 18,490 we are almost there there is like one extra row inside the file and that’s because of the header the first Header information is not loaded inside our table and that’s why always in our tables we’re going to have one less row than the original files so everything looks nice and we have done this step correctly now if I go and run it again what’s going to happen we will get dcat inside the bronze layer so now we have loaded the file like twice inside the same table which is not really correct the method that we have discussed is first to make the table empty and then load trate and then insert in order to do that before the bulk inserts what we’re going to do we’re going to say truncate table and then we’re going to have our table and that’s it with a semicolon so now what we are doing is first we are making the table empty and then we start loading from the scratch we are loading the whole content of the file inside the table and this is what we call full load so now let’s go and Mark everything together and execute and again if you go and check the content of the table you can see we have only 18,000 rows let’s go and run it again the count of the bronze layer you can see we still have the 18,000 so each time you run this script now we are refreshing the table customer info from the file into the database table so we are refreshing the bronze layer table so that means if there is like now any changes in the file it will be loaded to the table so this is how you do a full load in the bronze layer by trating the table and then doing the inserts and now of course what we have to do is to Bow the video and go and write WR the same script for all six files so let’s go and do [Music] that okay back so I hope that you have as well written all those scripts so I have the three tables in order to load the First Source system and then three sections in order to load the Second Source system and as I’m writing those scripts make sure to have the correct path so for the Second Source system you have to go and change the path for the other folder and as well don’t forget the table name on the bronze layer is different from the file name because we start always with the source system name with the files we don’t have that so now I think I have everything is ready so let’s go and execute the whole thing perfect awesome so everything is working let me check the messages so we can see from the message how many rows are inserted in each table and now of course the task is to go through each table and check the content so that means now we have really ni script in order to load the bronze layer and we will use this script in daily basis every day we have to run it in order to get a new content to the data warehouse and as you learned before if you have like a script of SQL that is frequently used what we can do we can go and create a stored procedure from those scripts so let’s go and do that it’s going to be very simple we’re going to go over here and say create or alter procedure and now we have to define the name of the Sol procedure I’m going to go and put it in the schema bronze because it belongs to the bronze layer so then we’re going to go and follow the naming convention the S procedure starts with load underscore and then the bronze layer so that’s it about the name and then very important we have to define the begin and as well the end of our SQL statements so here is the beginning and let’s go to the end and say this is the end and then let’s go highlight everything in between and give it one push with tab so with that it is easier to read so now next one we’re going to do we’re going to go and execute it so let’s go and create this St procedure and now if you want to go and check your St procedure you go to the database and then we have here folder called programmability and then inside we have start procedure so if you go and refresh you will see our new start procedure let’s go and test it so I’m going to go and have new query and what we’re going to do we’re going to say execute bronze. load bronze so let’s go and execute it and with that we have just loaded completely the pron layer so as you can see SQL did go and insert all the data from the files to the bronze layer it is way easier than each time running those scripts of course all right so now the next step is that as you can see the output message it is really not having a lot of informations the message of your ETL with s procedure it will not be really clear so that’s why if you are writing an ETL script always take care of the messaging of your code so let me show you a nice design let’s go back to our St procedure so now what we can do we can go and divide the message p based on our code so now we can start with a message for example over here let’s say print and we say what you are doing with this thir procedure we are loading the bronze ler so this is the main message the most important one and we can go and play with the separators like this so we can say print and now we can go and add some nice separators like for example the equals at the start and at the end just to have like a section so this is just a nice message at the start so now by looking to our code we can see that our code is splited into two sections the first section we are loading all the tables from The Source system CRM and the second section is loading the tables from the Erp so we can split the prints by The Source system so let’s go and do that so we’re going to say print and we’re going to say loading CRM tables this is for the first section and then we can go and add some nice separators like the one let’s take the minus and of course don’t forget to add semicolons like me so we can to have semicolon for each print same thing over here I will go and copy the whole thing because we’re going to have it at the start and as well at the end let’s go copy the whole thing for the second section so for the Erp it starts over here and we’re going to have it like this and we’re going to call it loading Erp so with that in the output we can see nice separation between loading each Source system now we go to the next step where we go and add like a print for each action so for example here we are Tran getting the table so we say print and now what we can do we can go and add two arrows and we say what we are doing so we are trating the table and then we can go and add the table name in the message as well so this is the first action that we are doing and we can go and add another print for inserting the data so we can say inserting data into and then we have the table name so with that in the output we can understand what SQL is doing so let’s go and repeat this for all other tables Okay so I just added all those prints and don’t forget the semicolon at the end so I would say let’s go and execute it and check the output so let’s go and do that and then maybe at the start just to have quick output execute our stored procedure like this so let’s see now if you check the output you can see things are more organized than before so at the start we are reading okay we are loading the bronze layer now first we are loading the source system CRM and then the second section is for the Erp and we can see the actions so we trating inserting trating inserting for each table and as well the same thing for the Second Source so as you can see it is nice and cosmetic but it’s very important as you are debugging any errors and speaking of Errors we have to go and handle the errors in our St procedure so let’s go and do that it’s going to be the first thing that we do we say begin try and then we go to the end of our scripts and we say before the last end we say end try and then the next thing we have to add the catch so we’re going to say begin catch and end catch so now first let’s go and organize our code I’m going to take the whole codes and give it one more push and as well the begin try so it is more organized and as you know the try and catch is going to go and execute the try and if there is like any errors during executing this script the second section going to be executed so the catch will be executed only if the SQL failed to run that try so now what we have to do is to go and Define for SQL what to do if there’s like an error in your code and here we can do multiple stuff like maybe creating a logging tables and add the messages inside this table or we can go and add some nice messaging to the output like very example we can go and add like a section again over here so again some equals and we can go and repeat it over here and then add some content in between so we can start with something like to say error Accord during loading bronze layer and then we can go and add many stuff like for example we can go and add the error message and here we can go and call the function error message and we can go and add as well for example the error number so error number and of course the output of this going to be in number but the error message here is a text so we have to go and change the data type so we’re going to do a cast as in VAR Char like this and then there is like many functions that you can add to the output like for example the error States and so on so you can design what can happen if there is an error in the ETL now what else is very important in each ETL process is to add the duration of each like step so for example I would like to understand how long it takes to load this table over here but looking to the output I don’t have any informations how long is taking to load my tables and this is very important because because as you are building like a big data warehouse the ATL process is going to take long time and you would like to understand where is the issue where is the bottleneck which table is consuming a lot of time to be loaded so that’s why we have to add those informations as well to the output or even maybe to protocol it in a table so let’s go and add as well this step so we’re going to go to the start and now in order to calculate the duration you need the starting time and the end time so we have to understand when we started loaded and when we ended loading the table so now the first thing is we have to go and declare the variables so we’re going to say declare and then let’s make one called start time and the data type of this going to be the date time I need exactly the second when it started and then another one for the end time so another variable end time and as well the same thing date time so with that we have declared the variables and the next step is to go and use them so now let’s go to the first table to the customer info and at the start we’re going to say set start time equal to get date so we will get the exact time when we start loading this table and then let’s go and copy the whole thing and go to the end of loading over here so we’re going to say set this time the end time equal as well to the get dates so with that now we have the values of when we start loading this table and when we completed loading the table and now the next step is we have to go and print the duration those informations so over here we can go and say print and we can go and have as again the same design so two arrows and we can say very simply load duration and then double points and space and now what we have to do is to calculate the duration and we can do that using the date and time function date diff in order to find the interval between two dates so we’re going to say plus over here and then use date diff and here we have to Define three arguments first one is the unit so you can Define second minute hours and so on so we’re going to go with a second and then we’re going to define the start of the interval it’s going to be the start time and then the last argument is going to be the end of the boundary it’s going to be the end time and now of course the output of this going to be in number that’s why we have to go and cast it so we’re going to say cast as enar Char and then we’re going to close it like this and maybe at the ends we’re going to say plus space seconds in order to have a nice message so again what we have done we have declared the two variables and we are using them at the start we we are getting the current date and time and at the end of loading the table we are getting the current date and time and then we are finding the differences between them in order to get the load duration and in this case we are just priting this information and now we can go of course and add some nice separator between each table so I’m going to go and do it like this just few minuses not a lot of stuff so now what we have to do is to go and add this mechanism for each table in order to measure the speed of the ETL for each one of [Music] them okay so now I have added all those configurations for each table and let’s go and run the whole thing now so let’s go and edit the stor procedure this and we’re going to go and run it so let’s go and execute so now as you can see we have here one more info about the load durations and it is everywhere I can see we have zero seconds and that’s because it is super fast of loading those informations we are doing everything locally at PC so loading the data from files to database going to be Mega fast but of course in real projects you have like different servers and networking between them and you have millions of rods in the tables of course the duration going to be not like 0 seconds things going to be slower and now you can see easily how long it takes to load each of your tables and now of course what is very interesting is to understand how long it takes to load the whole pron lier so now your task is is as well to print at the ends informations about the whole patch how long it took to load the bronze [Music] layer okay I hope we are done now I have done it like this we have to Define two new variables so the start time of the batch and the end time of the batch and the first step in the start procedure is to get that date and time informations for the first variable and exactly at the end the last thing that we do in the start procedure we’re going to go and get the date and time informations for the end time so we say again set get date for the patch in time and then all what you have to do is to go and print a message so we are saying loading bronze layer is completed and then we are printing total load duration and the same thing with a date difference between the patch start time and the end time and we are calculating the seconds and so on so now what you have to do is to go and execute the whole thing so let’s go and refresh the definition of the S procedure and then let’s go and execute it so in the output we have to go to the last message and we can see loading pron layer is completed and the total load duration is as well 0 seconds because the execution time is less than 1 seconds so with that you are getting now a feeling about how to build an ETL process so as you can see the data engineering is not all about how to load the data it’s how to engineer the whole pipeline how to measure the speed of loading the data what can happen happen if there’s like an error and to print each step in your ETL process and make everything organized and cleared in the output and maybe in the logging just to make debugging and optimizing the performance way easier and there is like a lot of things that we can add we can add the quality measures and stuff so we can add many stuff to our ETL scripts to make our data warehouse professional all right my friends so with that we have developed a code in order to load the pron layer and we have tested that as well and now in the next step we we’re going to go back to draw because we want to draw a diagram about the data flow so let’s go so now what is a data flow diagram we’re going to draw a Syle visual in order to map the flow of your data where it come froms and where it ends up so we want just to make clear how the data flows through different layers of your projects and that’s help us to create something called the data lineage and this is really nice especially if you are analyzing an issue so if you have like multiple layers and you don’t have a real data lineage or flow it’s going to be really hard to analyze the scripts in order to understand the origin of the data and having this diagram going to improve the process of finding issues so now let’s go and create one okay so now back to draw and we’re going to go and build the flow diagram so we’re going to start first with the source system so let’s build the layer I’m going to go and remove the fill dotted and then we’re going to go and add like a box saying sources and we’re going to put it over here increase the size 24 and as well without any lines now what do we have inside the sources we have like folder and files so let’s go and search for a folder icon I’m going to go and take this one over here and say you are the CRM and we can as well increase the size and we have another source we have the Erp okay so this is the first layer let’s go and now have the bronze layer so we’re going to go and grab another box and we’re going to go and make the coloring like this and instead of Auto maybe take the hatch maybe something like this whatever you know so rounded and then we can go and put on top of it like the title so we can say you are the bronze layer and increase as well the size of the font so now what you’re going to do we’re going to go and add boxes for each table that we have in the bronze layer so for example we have the sales details we can go and make it little bit smaller so maybe 16 and not bold and we have other two tables from the CRM we have the customer info and as well the product info so those are the three tables that comes from the CRM and now what we’re going to do we’re going to go and connect now the source CRM with all three tables so what we going to do we’re going to go to the folder and start making arrows from the folder to the bronze layer like this and now we have to do the same thing for the Erp source so as you can see the data flow diagram shows us in one picture the data lineage between the two layers so here we can see easily those three tables actually comes from the CRM and as well those three tables in the bronze layer are coming from the Erp I understand if we have like a lot of tables it’s going to be a huge Miss but if you have like small or medium data warehouse building those diagrams going to make things really easier to understand how everything is Flowing from the sources into the different layers in your data warehouse all right so with that we have the first version of the data flow so this step is done and the final step is to commit our code in the get repo okay so now let’s go and commit our work since it is scripts we’re going to go to the folder scripts and here we’re going to have like scripts for the bronze silver and gold that’s why maybe it makes sense to create a folder for each layer so let’s go and start creating the bronze folder so I’m going to go and create a new file and then I’m going to say pron slash and then we can have the DL script of the pron layer dot SQL so now I’m going to go and paste the edal codes that we have created so those six tables and as usual at the start we have a comment where we are explaining the purpose of these scripts so we are saying these scripts creates tables in the pron schema and by running the scripts you are redefining the DL structure of the pron tables so let’s have it like that and I’m going to go and commit the changes all right so now as you can see inside the scripts we have a folder called bronze and inside it we have the ddl script for the bronze layer and as well in the pron layer we’re going to go and put our start procedure so we’re going to go and create a new file let’s call it proc load bronze. SQL and then let’s go and paste our scripts and as usual I have put it at the start an explanation about the sord procedure so we are seeing this St procedure going to go and load the data from the CSV files into the pron schema so it going go and truncate first the tables and then do a pulk inserts and about the parameters this s procedure does not accept any parameter or return any values and here a quick example how to execute it all right so I think I’m happy with that so let’s go and commit it all right my friends so with that we have committed our code into the gch and with that we are done building the pron layer so the whole is done now we’re going to go to the next one this one going to be more advanced than the bronze layer because the there will be a lot of struggle with cleaning the data and so on so we’re going to start with the first task where we’re going to analyze and explore the data in the source systems so let’s go okay so now we’re going to start with the big question how to build the silver layer what is the process okay as usual first things first we have to analyze and now the task before building anything in the silver layer we have to go and explore the data in order to understand the content of our sources once we have it what we’re going to do we will be starting coding and here the transformation that we’re going to do is data cleansing this is usually process that take really long time and I usually do it in three steps the first step is to check first the data quality issues that we have in the pron layer so before writing any data Transformations first we have to understand what are the issues and only then I start writing data transformations in order to fix all those quality issues that we have in the bronze and the last step once I have clean results what we’re going to do we’re going to go and inserted into the silver layer and those are the three faces that we will be doing as we are writing the code for the silver layer and the third step once we have all the data in the server layer we have to make sure that the data is now correct and we don’t have any quality issues anymore and if you find any issues of course what you going to do we’re going to go back to coding we’re going to do the data cleansing and again check so it is like a cycle between validating and coding once the quality of the silver layer is good we cannot skip the last phase where we going to document and commit our work in the Gs and here we’re going to have two new documentations we’re going to build the data flow diagram and as well the data integration diagram after we understood the relationship between the sources from the first step so this is the process and this is how we going to build the server layer all right so now exploring the data in the pron layer so why it is very important because understanding the data it is the key to make smart decisions in the server layer it was not the focus in the BR layer to understand the content of the data at all we focused only how to get the data to the data warehouse so that’s why we have now to take a moment in order to explore and understand the tables and as well how to connect them what are the relationship between these tables and it is very important as you are learning about a new source system is to create like some kind of documentation so now let’s go and explore the sources okay so now let’s go and explore them one by one we can start with the first one from the CRM we have the customer info so right click on it and say select top thousand rows and this is of course important if you have like a lot of data don’t go and explore millions of rows always limit your queries so for example here we are using the top thousands just to make sure that you are not impacting the system with your queries so now let’s have a look to the content of this table so we can see that we have here customer informations so we have an ID we have a key for the customer we have first name last name my Ral status gender and the creation date of the customer so simply this is a table for the customer customer information and a lot of details for the customers and here we have like two identifiers one it is like technical ID and another one it’s like the customer number so maybe we can use either the ID or the key in order to join it with other tables so now what I usually do is to go and draw like data model or let’s say integration model just to document and visual what I am understanding because if you don’t do that you’re going to forget it after a while so now we go and search for a shape let’s search for table and I’m going to go and pick this one over here so here we can go and change the style for example we can make it rounded or you can go make it sketch and so on and we can go and change the color so I’m going to make it blue then go to the text make sure to select the whole thing and let’s make it bigger 26 and then what I’m going to do for those items I’m just going to select them and go to arrange and maybe make it 40 something like this so now what we’re going to do we’re going to just go and put the table name so this is the one that we are now learning about and what I’m going to do I’m just going to go and put here the primary key I will not go and list all the informations so the primary key was the ID and I will go and remove all those stuff I don’t need it now as you can see the table name is not really friendly so I can go and bring a text and put it here on top and say this is the customer information just to make it friendly and do not forget about it and as well going to increase the size to maybe 20 something like this okay with that we have our first table and we’re going to go and keep exploring so let’s move to the second one we’re going to take the product information right click on it and select the top thousand rows I will just put it below the previous query query it now by looking to this table we can see we have product informations so we have here a primary key for the product and then we have like key or let’s say product number and after that we have the full name of the product the product costs and then we have the product line and then we have like start and end well this is interesting to understand why we have start and ends let’s have a look for example for those three rows all of those three having the same key but they have different IDs so it is the same product but with different costs so for 2011 we have the cost of 12 then 2012 we have 14 and for the last year 2013 we have 13 so it’s like we have like a history for the changes so this table not only holding the current affirmations of the product but also history informations of the products and that’s why we have those two dates start and end now let’s go back and draw this information over here so I’m just going to go and duplicate it so the name of this table going to be the BRD info and let’s go and give it like a short description current and history products information something like this just to not forget that we have history in this table and here we have as well the PRD ID and there is like nothing that we can use in order to join those two tables we don’t have like a customer ID here or in the other table we don’t have any product ID okay so that’s it for this table let’s jump to the third table and the last one in the CRM so let’s go and select I just made other queries as well short so let’s go and execute so what do you have over here we have a lot of informations about the order the sales and a lot of measures order number we have the product key so this is something that we can use in order to join it with the product table we have the customer ID we don’t have the customer key so here we have like ID and here we have key so there’s like two different ways on how to join tables and then we have here like dates the order dates the shipping date the due date and then we have the sales amount the quantity and the price so this is like an event table it is transactional table about the orders and sales and it is great table in order to connect the customers with the products and as well with the orders so let’s document this new information that we have so the table name is the sales details so we can go and describe it like this transactional records about sales and orders and now we have to go and describe how we can connect this table to the other two so we are not using the product ID we are using the product key and now we need a new column over here so you can hold control and enter or you can go over here and add a new row and the other row is going to be the customer ID so now for the the customer ID it is easy we can gr and grab an arrow in order to connect those two tables but for the product key we are not using the ID so that’s why I’m just going to go and remove this one and say product key let’s have here again a check so this is a product key it’s not a product ID and if we go and check the old table the products info you can see we are using this key and not the primary key so what we’re going to do now we will just go and Link it like this and maybe switch those two tables so I will put the customer below just perfect it looks nice okay so let’s keep moving let’s go now to the other source system we have the Erp and the first one is ARB cost and we have this cryptical name let’s go and select the data so now here it’s small table and we have only three informations so we have here something called C and then we have something I think this is the birthday and the gender information so we have here male female and so on so it looks again like the customer informations but here we have like extra data about the birthday and now if you go and compare it to the customer table that we have from the other source system let’s go and query it you can see the new table from the Erb don’t have IDs it has actually the customer number or the key so we can go and join those two tables using the customer key let’s go and document this information so I will just go and copy paste and put it here on the right side I will just go and change the color now since we are now talking about different Source system and here the table name going to be this one and the key called C ID now in order to join this table with the customer info we cannot join it with the customer ID we need the customer key that’s why here we have to go and add a new row so contrl enter and we’re going to say customer key and then we have to go and make a nice Arrow between those two keys so we’re going to go and give it a description customer information and here we have the birth dates okay so now let’s keep going we’re going to go to the next one we have the Erp location let’s go and query this table so what do you have over here we have the CID again and as you can see we have country informations and this is of course again the customer number and we have only this information the country so let’s go and docment this information this is the customer location table name going to be like this and we still have the same ID so we have here still the customer ID and we can go and join it using the customer key and we have to give it the description locate of customers and we can say here the country okay so now let’s go to the last table and explore it we have the Erp PX catalog so let’s go and query those informations so what do we have here we have like an ID a category a subcategory and the maintenance here we have like either yes and no so by looking to this table we have all the categories and the subcategories of the products and here we have like special identifier for those informations now the question is how to join it so I would like to join it actually with the product informations so let’s go and check those two tables together okay so in the products we don’t have any ID for the categories but we have these informations actually in the product key so the first five characters of the product key is actually the category ID so we can use this information over here in order to join it with the categories so we can go and describe this information like this and then we have to go and give it a name and then here we have the ID and the ID could be joined using the product key so that means for the product information we don’t need at all the product ID the primary key all what we need is the product key or the product number and what I would like to do is like to group those informations in a box so let’s go grab like any boxes here on the left side and make it bigger and then make the edges a little bit smaller let’s remove move the fill and the line I will make a dotted line and then let’s grab another box over here and say this is the CRM and we can go and increase the size maybe something like 40 smaller 35 bold and change the color to Blue and just place it here on top of this box so with that we can understand all those tables belongs to the source system CRM and we can do the same stuff for the right side as well now of course we have to go and add the description here so it’s going to be the product categories all right so with that we have now clear understanding how the tables are connected to each others we understand now the content of each table and of course it can to help us to clean up the data in the silver layer in order to prepare it so as you can see it is very important to take time understanding the structure of the tables the relationship between them before start writing any code all right so with that we have now clear understanding about the sources and with that we have as well created a data integration in the dro so with that we have more understanding about how to connect the sources and now in the next two task we will go back to SQL where we’re going to start checking the quality and as well doing a lot of data Transformations so let’s go okay so now let’s have a quick look to the specifications of the server layer so the main objective to have clean and standardized data we have to prepare the data before going to the gold layer and we will be building tables inside the silver layer and the way of loading the data from the bronze to the silver is a full load so that means we’re going to trate and then insert and here we’re going to have a lot of data Transformations so we’re going to clean the data we’re going to bring normalizations standardizations we’re going to derive new columns we will be doing as well data enrichment so a lot of things to be done in the data transformation but we will not be building any new data model so those are the specifications and we have to commit ourself to this scope okay so now building the ddl script for the layer going to be way easier than the bronze because the definition and the structure of each table in the silver going to be identical to the bronze layer we are not doing anything new so all what you have to do is to take the ddl script from the bronze layer and just go and search and replace for the schema I’m just using the notepad++ for the scripts so I’m going to go over here and say replace the bronze dots with silver dots and I’m going to go and replace all so with that now all the ddl is targeting the schema silver layer which is exactly what we need all right now before we execute our new ddl script for the silver we have to talk about something called the metadata columns they are additional columns or fields that the data Engineers add to each table that don’t come directly from the source systems but the data Engineers use it in order to provide extra informations for each record like we can add a column called create date is when the record was loaded or an update date when the the record got updated or we can add the source system in order to understand the origin of the data that we have or sometimes we can add the file location in order to understand the lineage from which file the data come from those are great tool if you have data issue in your data warehouse if there is like corrupt data and so on this can help you to track exactly where this issue happens and when and as well it is great in order to understand whether I have Gap in my data especially if you are doing incremental mod it is like putting labels on everything and you will thank yourself later when you start using them in hard times as you have an issue in your data warehouse so now back to our ddl scripts and all what you have to do is to go and do the following so for example for the first table I will go and add at the end one more extra column so it start with the prefix DW as we have defined in the naming convention and then underscore let’s have the create dates and the data tabe going to be date time to and now what we can do is we can go and add a default value for it I want the database to generate these informations automatically we don’t have to specify that in any ETL scripts so which value it’s going to be the get datee so each record going to be inserted in this table will get automatically a value from the current date and time so now as you can see the naming convention it is very important all those columns comes from the source system and only this one column comes from the data engineer of the data warehouse okay so that’s it let’s go and repeat the same thing for all other tables so I will just go and add this piece of information for each ddl all right so I think that’s it all what you have to do is now to go and execute the whole ddl script for the silver layer let’s go into that all right perfect there’s no errors let’s go and refresh the tables on the object Explorer and with that as you can see we have six tables for the silver layer it is identical to the bronze layer but we have one extra column for the metadata all right so now in the server layer before we start writing any data Transformations and cleansing we have first to detect the quality issues in the pron without knowing the issues we cannot find solution right we will explore first the quality issues only then we start writing the transformation scripts so let’s [Music] go okay so now what we’re going to do we’re going to go through all the tables over the bronze layer clean up the data and then insert it to the server layer so let’s start with the first table the first bronze table from The Source CRM so we’re going to go to the bronze CRM customer info so let’s go and query the data over here now of course before writing any data and Transformations we have to go and detect and identify the quality issues of this table so usually I start with the first check where we go and check the primary key so we have to go and check whether there are nulls inside the primary key and whether there are duplicates so now in order to detect the duplicates in the primary key what we have to do is to go and aggregate the primary key if we find any value in the primary key that exist more than once that means it is not unique and we have duplicates in the table so let’s go and write query for that so what we’re going to do we’re going to go with the customer ID and then we’re going to go and count and then we have to group up the data so Group by based on the primary key and of course we don’t need all the results we need only where we have an issue so we’re going to say having counts higher than one so we are interested in the values where the count is higher than one so let’s go and execute it now as you can see we have issue in this table we have duplicates because all those IDs exist more than one in the table which is completely wrong we should have the primary key unique and you can see as well we have three records where the primary key is empty which is as well a bad thing now there is an issue here if we have only one null it will not be here at the result so what I’m going to do I’m going to go over here and say or the primary key is null just in case if we have only one null I’m still interested to see the results so if I go and run it again we’ll get the same results so this is equality check that you can do on the table and as you can see it is not meeting the expectation so that means we have to do something about it so let’s go and create a new query so here what we’re going to do we can to start writing the query that is doing the data transformation and the data cleansing so let’s start again by selecting the [Music] data and excuse it again so now what I usually do I go and focus on the issue so for example let’s go and take one of those values and I focus on it before start writing the transformation so we’re going to say where customer ID equal to this value all right so now as you can see we have here the issue where the ID exist three times but actually we are interested only on one of them so the question is how to pick one of those usually we search for a timestamp or date value to help us so if you check the creation date over here we can understand that this record this one over here is the newest one and the previous two are older than it so that means if I have to go and pick one of those values I would like to get the latest one because it holds the most fresh information so what we have to do is we have to go and rank all those values based on the create dates and only pick the highest one so that means we need a ranking function and for that in scale we have the amazing window functions so let’s go and do that we will use the function row number over and then Partition by and here we have to divide the table by the customer ID so we’re going to divide it by the customer ID and in order now to rank those rows we have to sort the data by something so order by and as we discussed we want to sort the data by the creation date so create date and we’re going to sort it descending so the highest first then the lowest so let’s go and do that and now we’re going to go and give it the name flag last so now let’s go and executed now the data is sorted by the creation date and you can see over here that this record is the number one then the one that is older is two and the oldest one is three of course we are interested in the rank number one now let’s go and moove the filter and check everything so now if you have a look to the table you can see that on the flag we have everywhere like one and that’s because the those primary Keys exist only one but sometimes we will not have one we will have two three and so on if there’s like duplicates we can go of course and do a double check so let’s go over here and say select star from this query we’re going to say where flag last is in equal to one so let’s go and query it and now we can see all the data that we don’t need because they are causing duplicates in the primary key and they have like an old status so what we’re going to do we’re going to say equal to one and with that we guarantee that our primary key is unique and each value exist only once so if I go and query it like this you will see we will not find any duplicate inside our table and we can go and check that of course so let’s go and check this primary key and we’re going to say and customer ID equal to this value and you can see it exists now only once and we are getting the freshest data from this key so with that we have defined like transformation in order to remove any D Kates okay so now moving on to the next one as you can see in our table we have a lot of values where they are like string values now for these string values we have to check the unwanted spaces so now let’s go and write a query that’s going to detect those unwanted spaces so we’re going to say select this column the first name from our table bronze customer information so let’s go and query it now by just looking to the data it’s going to be really hard to find those unwanted spaces especially if they are at the end of the world but there is a very easy way in order to detect those issues so what we’re going to do we’re going to do a filter so now we’re going to say the first name is not equal to the first name after trimming the values so if you use the function trim what it going to do it’s going to go and remove all the leading and trailing spaces so the first name so if this value is not equal to the first name after trimming it then we have an issue so it is very simple let’s go and execute it so now in the result we will get the list of all first names where we have spaces either at the start or at the end so again the expectation here is no results and the same thing we can go and check something else like for example the last name so let’s go and do that over here and here let’s go and execute it we see in the result we have as well customers where they have like space in their last name which is not really good and we can go and keep checking all the string values that you have inside the table so for example the gender so let’s go and check that and execute now as you can see we don’t have any results that means the quality of the gender is better and we don’t have any unwanted spaces so now we have to go and write transformation in order to clean up those two columns now what I’m going to do I’m just going to go and list all the column in the query instead of the star all right so now I have a list of all the columns that I need and now what we have to do is to go to those two columns and start removing The Unwanted spaces so we’ll just use the trim it’s very simple and give it a name of course the same name and we will trim as well the last name so let’s go and query this and with that we have cleaned up those two colums from any unwanted spaces okay so now moving on we have those two informations we have the marital status and as well the gender if you check the values inside those two columns as you can see we have here low cardinality so we have limited numbers of possible values that is used inside those two columns so what we usually do is to go and check the data consistency inside those two columns so it’s very simple what we’re going to do we’re going to do the following we’re going to say distinct and we’re going to check the values let’s go and do that and now as you can see we have only three possible values either null F or M which is okay we can stay like this of course but we can make a rule in our project where we can say we will not be working with data abbreviations we will go and use only friendly full names so instead of having an F we’re going to have like a full word female and instead of M we’re going to have like male and we make it as a rule for the whole project so each time we find the gender informations we try to give the full name of it so let’s go and map those two values to a friendly one so we’re going to go to the gender of over here and say case when and we’re going to say the gender is equal to F then make it a female and when it is equal to M then M it to male and now we have to make decision about the nulls as you can see over here we have nulls so do we want to leave it as a null or we want to use always the value unknown so with that we are replacing the missing values with a standard default value or you can leave it as a null but let’s say in our project that we are replacing all the missing value with a default value so let’s go and do that we going to say else I’m going to go with the na not available or you can go with the unknown of course so that’s for the gender information like this and we can go and remove the old one and now there is one thing that I usually do in this case where sometimes what happens currently we are getting the capital F and the capital M but maybe in the the time something changed and you will get like lower M and lower F so just to make sure in those cases we still are able to map those values to the correct value what we’re going to do we’re going to just use the function upper just to make sure that if we get any lowercase values we are able to catch it so the same thing over here as well and now one more thing that you can add as well of course if you are not trusting the data because we saw some unwanted spaces in the first name and the last name you might not trust that in the future you will get here as well unwanted spaces you can go and make sure to trim everything just to make sure that you are catching all those cases so that’s it for now let’s go and excute now as you can see we don’t have an m and an F we have a full word male and female and if we don’t have a value we don’t have a null we are getting here not available now we can go and do the same stuff for the Merial status you can see as well we have only three possibil ities the S null and an M we can go and do the same stuff so I will just go and copy everything from here and I will go and use the marital status I just remove this one from here and now what are the possible values we have the S so it’s going to be single we have an M for married and we have as well a null and with that we are getting the not available so with that we are making as well data standardizations for this column so let’s go and execute it now as you can see we don’t have those short values we have a full friendly value for the status and as well for the gender and at the same time we are handling the nulls inside those two columns so with that we are done with those two columns and now we can go to the last one that create date for this type of informations we make sure that this column is a real date and not as a string or barar and as we defined it in the data type it is a date which is completely correct so nothing to do with this column and now the next step is that we’re going to go and write the insert statement so how we’re going to do it we’re going to go to the start over here and say insert into silver do SRM customer info now we have to go and specify all the columns that should be inserted so we’re going to go and type it so something like this and then we have the query over here let’s go and execute it so let’s do that so with that we have inserted clean data inside the silver table so now what we’re going to do we’re going to go and take all the queries that we have used used in order to check the quality of the bronze and let’s go and take it to another query and instead of having bronze we’re going to say silver so this is about the primary key let’s go and execute it perfect we don’t have any results so we don’t have any duplicates the same thing for the next one so the silver and it was for the first name so let’s go and check the first name and run it as you can see there is no results it is perfect we don’t have any issues you can of course go and check the last name and run it again we don’t have any result over here and now we can go and check those low cardinality columns like for example the gender let’s go and execute it so as you can see we have the not available or the unknown male and female so perfect and you can go and have a final look to the table to the silver customer info let’s go and check that so now we can have a look to all those columns as you can see everything looks perfect and you can see it is working this metadata information that we have added to the table definition now it says when we have inserted all those three cords to the table which is really amazing information to have a track and audit okay so now by looking to the script we have done different types of data Transformations the first one is with the first name and the last name here we have done trimming removing unwanted spaces this is one of the types of data cleansing so we remove unnecessary spaces or unwanted characters to to ensure data consistency now moving on to the next transformation we have this casewin so what we have done here is data normalization or we call it sometimes data standardization so this transformation is type of data cleansing where we can map coded values to meaningful userfriendly description and we have done the same transformation as well to the agender another type of transformation that we have done as well in the same case when is that we have handled the missing values so instead of nulls we can have not available so handling missing data is as well type of data cleansing where we are filling the blanks by adding for example a default value so instead of having an empty string or a null we’re going to have a default value like the not available or unknown another type of data and Transformations that we have done in this script is we have removed the duplicates so removing duplicates is as well type of data cleansing where we ensure only one record for each primary key by identifying and retaining only the most relevant role to ensure there is no duplicates inside our data and as we are removing the duplicates of course we are doing data filtering so those are the different types of data Transformations that we have done in this script all right moving on to the second table in the bronze layer from the CRM we have the product info and of course as usual before we start writing any Transformations we have to search for data quality issues and we start with the first one we have to check the primary key so we have to check whether we have duplicates or nulls inside this key so what you have to do we have to group up the data by the primary key or check whether we have nulls so let’s go and execute it so as you can see everything is safe we don’t have dcat or nulls in the primary key now moving on to the next one we have the product key here we have in this column a lot of informations so now what you have to do is to go and split this string into two informations so we are deriving new two columns so now let’s start with the first one is the category ID the first five characters they are actually the category ID and we can go and use the substring function in order to extract part of a string it needs three arguments the first one going to be the column that we want to extract from and then we have to define the position where to extract and since the first part is on the left side we going to start from the first position and then we have to specify the length so how many characters we want to extract we need five characters so 1 2 3 4 five so that’s set for the category ID category ID let’s go and execute it now as you can see we have a new column called the category ID and it contains the first part of the string and in our database from the other source system we have as well the category ID now we can go and double check just in order to make sure that we can join data together so we’re going to go and check the ID from the pron table Erp and this can be from the category so in this table we have the category ID and you can see over here those are the IDS of the category and in the C layer we have to go and join those two tables but here we still have an issue we have here an underscore between the category and the subcategory but in our table we have actually a minus so we have to replace that with an underscore in order to have matching informations between those two tables otherwise we will not be able to join the tables so we’re going to use the function replace and what we are replacing we are replacing the m with an underscore something like this and if you go now and execute it we will get an underscore exactly like the other table and of course we can go and check whether everything is matching by having very simple query where we say this new information not in and then we have this nice subquery so we are trying to find any category ID that is not available in the second table so let’s go and execute it now as you can see we have only one category that is not matching we are not finding it in this table which is maybe correct so if you go over here you will not find this category I just make it a little bit bigger so we are not finding this one category from this table which is fine so our check is okay okay so with that we have the first part now we have to go and extract the second part and we’re going to do the same thing so we’re going to use the substring and the three argument the product key but this time we will not start cutting from the first position we have to be in the middle so 1 2 2 3 4 5 6 7 so we start from the position number seven and now we have to define the length how many characters to be extracted but if you look over here you can see that we have different length of the product keys it is not fixed like the category ID so we cannot go and use specified number we have to make something Dynamic and there is Trick In order to do that we can to go and use the length of the whole column with that we make sure that we are always getting enough characters to be extra Ed and we will not be losing any informations so we will make it Dynamic like this we will not have it as a fixed length and with that we have the product key so let’s go and execute it as you can see we are now extracting the second part from this string now why we need the product key we need it in order to join it with another table called sales details so let’s go and check the sales details so let me just check the column name it is SLS product key so from bronze CRM sales let’s go and check the data over here and it looks wonderful so actually we can go and join those informations together but of course we can go and check that so we’re going to say where and we’re going to take our new column and we’re going to say not in the subquery just to make sure that we are not missing anything so let’s go and execute so it looks like we have a lot of products that don’t have any orders well I don’t have a nice feelings about it let’s go and try something like this one here and we say where LS BRD key like this value over here so I’ll just cut the last three just to search inside this table so we really don’t have such a keys let me just cut the second one so let’s go and search for it we don’t have it as well so anything that starts with the FK we don’t have any order with the product where it starts with the F key so let’s go and remove it but still we are able to join the tables right so if I go and say in instead of not in so with that you are able to match all those products so that means everything is fine actually it’s just products that don’t have any orders so with that I’m happy with this transformation now moving on to the next one we have here the name of the product we can go and check whether there is unwanted spaces so let’s go to our quality checks make sure to use the same table and we’re going to use the product name and check whether we find any unmatching after trimming so let’s go and do it well it looks really fine so we don’t have to trim anything this column is safe now moving on to the next one we have the costs so here we have numbers and we have to check the quality of the numbers so what we can do we can check whether we have nulls or negative numbers so negative costs or negative prices which is not really realistic depend on the business of course so let’s say in our business we don’t have any negative costs so it’s going to be like this let’s go and check whether is something less than zero or whether we have costs that is null so let’s go and check those informations well as you can see we don’t have any negative values but we have nulls so we can go and handle that by replacing the null with a zero of course if the business allow that so in SQL server in order to replace the null with a zero we have a very nice function called is null so we are saying if it is null then replace this value with a zero it is very simple like this and we give it a name of course so let’s go and execute it and as you can see we don’t have any more nulls we have zero which is better for the calculations if you are later doing any aggregate functions like the average now moving on to the next one we have the product line This is again abbreviation of something and the cardinality is low so let’s go and check all possible values inside this column so we’re just going to use the distinct going to be BRD line so let’s go and execute it and as you can see the possible values are null Mr rst and again those are abbreviations but in our data warehouse we have decided to give full nice names so we have to go and replace those codes those abbreviations with a friendly value and of course in order to get those informations I usually go and ask the expert from the The Source system or an expert from the process so let’s start building our case win and then let’s use the upper and as well the trim just to make sure that we are having all the cases so the BRD line is equal to so let’s start with the first value the M then we will get the friendly value it’s going to be Mountain then to the next one so I will just copy and paste here if it is an R then it is rods and another one for let me check what do we have here we have Mr and then s the S stands for other sales and we have the T so let’s go and get the T so the T stands for touring we have at the end an else for unknown not available so we don’t need any nulls so that’s it and we’re going to name it as before so product line so let’s remove the old one and let’s execute it and as you can see we don’t have here anymore those shortcuts and the abbreviations we have now full friendly value but I will go and have here like capital O it looks nicer so that we have nice friendly value now by looking to this case when as you can see it is always like we are mapping one value to another value and we are repeating all time upper time upper time and so on we have here a quick form in the case when if it is just a simple mapping so the syntax is very simple we say case and then we have the column so we are evaluating this value over here and then we just say when without the equal so if it is an M then make it Mountain the same thing for the next one and so so with that we have the functions only once and we don’t have to go and keep repeating the same function over and over and this one only if you are mapping values but if you have complex conditions you can do it like this but for now I’m going to stay with the quick form of the case wi it looks nicer and shorter so let’s go and execute it we will get the same results okay so now back to our table let’s go to the last two columns we have the start and end date so it’s like defining an interval we have start and end so let’s go and check the quality of the start and end dates we’re going to go and say select star from our bronze table and now we’re going to go and search it like this we are searching for the end date that is smaller than the starts so PRT start dates so let’s let’s go and query this so you can see the start is always like after the end which makes no sense at all so we have here data issue with those two dates so now for this kind of data Transformations what I usually do is I go and grab few examples and put it in Excel and try to think about how I’m going to go and fix it so here I took like two products this one and this one over here and for that we have like three rows for each one of them and we have this situation over here so the question now how we going to go and fix it I will go and make like a copy of one solution where we’re going to say it’s very simple let’s go and switch the start date with the end date so if I go and grab the end dates and put it at the starts things going to look way nicer right so we have the start is always younger than the end but my friends the data now makes no sense because we say it starts from 2007 and ends by 2011 the price was 12 but between 2018 and 2012 we have 14 which is not really good because if you take for example the year 2010 for 2010 it was 12 and at the same time 14 so it is really bad to have an overlapping between those two dates it should start from 2007 and end with 11 and then start febe from 12 and end with something else there should be no overlapping between years so it’s not enough to say the start should be always smaller than the end but as well the end of the first history should be younger than the start of the next records this is as well a rule in order to have no overlapping this one has no start but has already an end which is not really okay because we have always to have a starts each new record in historization has to has a start so for this record over here this is as well wrong and of course it is okay to have the start without an end so in this scenario it’s fine because this indicate this is the current informations about the costs so again this solution is not working at all so now for for the solution to what we can say let’s go and ignore completely the end date and we take only the start dates so let’s go and paste it over here but now we go and rebuild the end date completely from the start date following the rules that we have defined so the rule says the end of date of the current records comes from the start date from the next records so here this end date comes from this value over here from the next record so that means we take the next start date and put it at the end date for the previous records so with that as you can see it is working the end date is higher than the start dates and as well we are making sure this date is not overlapping with the next record but as well in order to make it way nicer we can subtract it with one so we can take the previous day like this so with that we are making sure the end date is smaller than the next start now for the next record this one over here the end date going to come from the next start date so we will take this one for here and put it as an end Ag and subtract it with one so we will get the previous day so now if you compare those two you can see it’s still higher than the start and if you compare it with the NY record this one over here it is still smaller than the next one so there is no overlapping and now for the last record since we don’t have here any informations it will be a null which is totally fine so as you can see I’m really happy with this scenario over here of course you can go and validate this with an exp from The Source system let’s say I’ve done that and they approved it and now I can go and clean up the data using this New Logic so this is how I usually brainstorm about fixing an issues if I have like a complex stuff I go and use Excel and then discuss it with the expert using this example it’s way better than showing a database queries and so on it just makees things easier to explain and as well to discuss so now how I usually do it I usually go and make a focus on only the columns that I need and take only one two scenarios while I’m building the logic and once everything is ready I go and integrate it in the query so now I’m focusing only on these columns and only for these products so now let’s go and build our logic now in SQL if you are at specific record and you want to access another information from another records and for that we have two amazing window functions we have the lead and lag in this scenario we want to access the next records that’s why we have to go with the function lead so let’s go and build it lead and then what do we need we need the lead or the start date so we want the start date of the next records and then we say over and we have to partition the data so the window going to be focusing on only one product which is the product key and not the product ID so we are dividing the data by product key and of course we have to go and sort the data so order by and we are sorting the data by the start dates and ascending so from the lowest to the highest and let’s go and give it another name so as let’s say test for example just to test the data so let’s go and execute and I think I missed something here it say Partition by so let’s go and execute again and now let’s go and check the results for the first partition over here so the start is 2011 and the end is 2012 and this information came from the next record so this data is moved to the previous record over here and the same thing for this record so the end date comes from the next record so our logic is working and the last record over here is null because we are at the end of the window and there is no next data that’s why we will get null and this is perfect of course so it looks really awesome but what is missing is we have to go and get the previous day and we can do that very simply using minus one we are just subtracting one day so we have no overlapping between those two dates and the same thing for those two dates so as you can see we have just buil a perfect end date which is way better than the original data that we got from the source system now let’s take this one over here and put it inside our query so we don’t need the end H we need our new end dat we just remove that test and execute now it looks perfect all right now we are not done yet with those two dates actually we are saying all time dates because we don’t have here any informations about the time always zero so it makes no sense to have these informations inside our data so what we can do we can do a very simple cast and we make this column as a date instead of date time so this is for the first one and as well for the next one as dates so let’s try that out and as you can see it is nicer we don’t have the time informations of course we can tell the source systems about all those issues but since they don’t provide the time it makes no sense to have date and time okay so it was a long run but we have now cleaned product informations and this is way nicer than the original product information that we got from the source CRM so if you grab the ddl of the server table you can see that we don’t have a category ID so we have product ID and product key and as well those two columns we just change the data type so it’s date time here but we have changed that to a date so that means we have to go and do few modifications to the ddl so what we going to do we’re going to go over here and say category ID and I will be using the same data type and for the start and end this time it’s going to be date and not date and time so that’s it for now let’s go ah and execute it in order to repair the ddl and this is what happen in the silver layer sometimes we have to adjust the metadata if the quality of the data types and so on is not good or we are building new derived informations in order later to integrate the data so it will be like very close to the bronze layer but with few modifications so make sure to update your ddl scripts and now the next step is that we’re going to go and insert the data into the table and now the next step we’re going to go and insert the result of this query that is cleaning up the bronze table into the silver table so as we’ done it before insert into silver the product info and then we have to go and list all the columns I’ve just prepared those columns so with that we can go and now run our query in order to insert the data so now as you can see SQL did insert the data and the very important step is now to check the quality of the silver table so we go back to our data quality checks and we go switch to the silver so let’s check the primary key there is no issues and we can go and check for example here the the trims there is as well no issue and now let’s go and check the costs it should not be negative or null which is perfect let’s go and check the data standardizations as you can see they are friendly and we don’t have any nulls and now very interesting the order of the dates so let’s go and check that as you can see we don’t have any issues and finally what I do I go and have a final look to the silver table and as we can see everything is inserted correctly in the correct color colums so all those columns comes from the source system and the last one is automatically generated from the ddl indicate when we loaded this table now let’s sit back and have a look to our script what are the different types of data Transformations that we have done here is for example over here the category ID and the product key we have derived new columns so it is when we create a new column based on calculations or transformations of an existing one so sometimes we need columns only for analytics and we cannot each time go to the source system and ask them to create it so instead of that we derive our own columns that we need for the analytics another transformation we have is that is null over here so we are handling here missing information instead of null we’re going to have a zero and one more transformation we have over here for the product line we have done here data normalization instead of having a code value we have a friendly value and as well we have handled the missing data for example over here instead of having a null we’re going to have not available all right moving on to another data transformation we have done data type casting so we are converting the data type from one to another and this considered as well to be a data transformation and now moving on to the last one we are doing as well data type casting but what’s more important we are doing data enrichment this type of transformation it’s all about adding a value to your data so we are adding a new relevant data to our data sets so those are the different types of data Transformations that we have done for this table okay so let’s keep going we have the sales details and this is the last table in the CRM so what do you have over here we have the order number and this is a string of course we can go and check whether we have an issue with the unwanted spaces so we can search whether we’re going to find something so we can say trim and something like this and let’s go and execute it so we can see that we don’t have any unwanted spaces that means we don’t have to transform this column so we can leave it as it is now the next two columns they are like keys and ideas is in order to connect it with the other tables as we learned before we are using the product key in order to connect it with the product informations and we are connecting the customer ID with the customer ID from the customer info so that means we have to go and check whether everything is working perfectly so we can go and check the Integrity of those columns where we say the product key Nots in and then we make a subquery and this time we can work with the silver layer right so we can say the product key from Silver do product info so let’s go and query this and as you can see we are not getting any issue that means all the product keys from the sales details can be used and connected with the product info the same thing we can go and check the Integrity of the customer ID and we can use not the products we can go to the customer info and the name was CST ID so let’s go and query that and the same thing we don’t have here any issues so that means we can go and connect the sales with the customers using the customer ID and we don’t have to do any Transformations for it so things looks really nice for those three columns now we come to the challenging one we have here the dates now those dates are not actual dates they are integer so those are numbers and we don’t want to have it like this we would like to clean that up we have to change the data type from integer to a DAT now if you want to convert an integer to a date we have to be careful with the values that we have inside each of those columns so now let’s check the quality for example of the order dates let’s say where order dates is less than zero for example something negative well we don’t have any negative values which is good let’s go and check whether we have any zeros well this is bad so we have here a lot of zeros now what we can do we can replace those informations with a null we can use of course the null IF function like this we can say null if and if it is zero then make it null so let’s execute it and as you can see now all those informations are null now let’s go and check again the data so now this integer has the years information at the start then the months and then the day so here we have to have like 1 2 3 4 5 so the length of each number should be H and if the length is less than eight or higher than eight then we have an issue let’s go and check that so we’re going to say or length sales order is not equal to eight that means less or higher let’s go and execute it now let’s go and check the results over here and those two informations they don’t look like dates so we cannot go and make from these informations a real dates they are just bad data and of course you can go and check the boundaries of a DAT like for example it should not be higher than for example let’s go and get this value 2050 and then I need for the month and the date so let’s go and execute it and if we just remove those informations just to make sure so we don’t have any date that is outside of the boundaries that you have in your business or you go for example and say the boundary should be not less than depend when your business started maybe something like this we are getting of course those values because they are less than n but if you have values around these dates you will get it as well in the query so we can go and add the rests so all those checks like validate the column that has date informations and it has the data type integer so again what are the issues over here we have zeros and sometimes we have like strange numbers that cannot be converted to a dates so let’s go and fix that in our query so we can say case when the sales order the order date is equal to zero or of the order date is not equal to 8 then null right we don’t want to deal with those values they are just wrong and they are not real dates otherwise we say else it’s going to be the order dates now what we’re going to do we’re going to go and convert this to a date we don’t want this as an integer so how we can do that we can go and cast it first to varar because we cannot cast from integer to date in SQL Server first you have to convert it to a varar and then from varar you go to a dates well this is how we do it in scq server so we cast it first to a varar and then we cast it to a date like this that’s it so we have end and we are using the same column name so this is how we transform an integer to a date so let’s go and query this and as you can see the order date now is a real date it is not a number so we can go and get rid of the old column now we have to go and do the same stuff for the shipping dates so we can go over here and replace everything with the shipping date and let’s go query well as you can see the shipping date is perfect we don’t have any issue with this column but still I don’t like that we found a lot of issues with the order dates so what we’re going to do just in case this happens for the shipping date in the future I will go and apply the same rules to the shipping dates oh let’s take the shipping date like this and if you don’t want to apply it now you have always to build like quality checks that runs every day in order to detect those issues and once you detect it then you can go and do the Transformations but for now I’m going to apply it right away so that is for the shipping date now we go to the due date and we will do the same test let’s go and execute it and as well it is perfect so still I’m going to apply the same rules so let’s get the D everywhere here in the query just make sure you don’t miss anything here so let’s go and execute now perfect as you can see we have the order date shipping date and due date and all of them are date and don’t have any wrong data inside those columns now still there is one more check that we can do and is that the order date should be always smaller than the shipping date or the due date because it’s makes no sense right if you are delivering an item without an order so first the order should happen then we are shipping the items so there is like an order of those dates and we can go and check that so we are checking now for invalid date orders where we going to say the order date is higher than the shipping date or we are searching as well for an order where the order date date is higher than the due dates so we going to have it like this due dates so let’s go and check well that’s really good we don’t have such a mistake on the data and the quality looks good so the order date is always smaller than the shipping date or the due dates so we don’t have to do any Transformations or cleanup okay friends now moving on to the last three columns we have the sales quantity and the price all those informations are connected to each others so we have a business rule or calculation it says the sales must be equal to quantity multiplied by the price and all sales quantity and price informations must be positive numbers so it’s not allowed to be negative zero or null so those are the business rules and we have to check the data consistency in our table does all those three informations following our rules so we’re going to start first with our rule right so we’re going to say if the sales is not equal to quantity multiplied by the price so we are searching where the result is not matching our expectation and as well we can go and check other stuff like the nulls so for example we can say or sales is null or quantity is null and the last one for the price and as well we can go and check whether they are negative numbers or zero so we can go over here and say less or equal to zero and apply it for the other columns as well so with that we are checking the calculation and as well we are checking whether we have null0 Z or negative numbers let’s go and check our informations I’m going to have here A distinct so let’s go and query it and of course we have here bad data but we can go and sort the data by the sales quantity and the price so let’s do it now by looking to the data we can see in the sales we have nulls we have negative numbers and zeros so we have all bad combinations and as well we have here bad calculations so as you can see the price here is 50 the quantity is one but the sales is two which is not correct and here we have as well wrong calculations here we have to have a 10 and here nine or maybe the price is wrong and by looking to the quantity now you can see we don’t have any nulls we don’t have any zeros or negative numbers so the quantity looks better than the sales and if you look to the prices we have nulls we have negatives and yeah we don’t have zeros so that means the quality of the sales and the price is wrong the calculation is not working and we have these scenarios now of course how I do it here I don’t go and try now to transform everything on my own I usually go and talk to an expert maybe someone from the business or from the source system and I show those scenarios and discuss and usually there is like two answers either they going to tell me you know what I will fix it in my source so I have to live with it there is incoming bad data and the bad data can be presented in the warehouse until the source system clean up those issues and the other answer you might get you know what we don’t have the budget and those data are really old and we are not going to do anything so here you have to decide either you leave it as it is or you say you know what let’s go and improve the quality of the data but here you have to ask for the experts to support you solving these issues because it really depend on their rules different rules makes different Transformations so now let’s say that we have the following rules if the sales informations are null or negative or zero then use the calculation the formula by multiplying the quality with the price and now if the prices are wrong for example we have here null or zero then go and calculate it from the sales and a quantity and if you have a price that is a minus like minus 21 a negative number then you have to go and convert it to a 21 so from a negative to a positive without any calculations so those are the rules and now we’re going to go and build the Transformations based on those rules so let’s do it step by step I will go over here and we’re going to start building the new sales so what is the rule Sals case when of course as usual if the sales is null or let’s say the sales is negative number or equal to zero or another scenario we have a sales information but it is not following the calculation so we have wrong information in the sales so we’re going to say the sales is not equal to the quantity multiplied by the price but of course we will not leave the price like this by using the function APS the absolute it’s going to go and convert everything from negative to a positive then what we have to do is to go and use the calculation so so it’s going to be the quantity multiplied by the price so that means we are not using the value that come from the source system we are recalculating it now let’s say the sales is correct and not one of those scenarios so we can say else we will go with the sales as it is that comes from the source because it is correct it’s really nice let’s go and say an end and give it the same name I will go and rename the old one here as an old value and the same for the price the quantity will not T it because it is correct so like this and now let’s go and transform the prices so again as usual we go with case wi so what are the scenarios the price is null or the price is less or equal to zero then what we’re going to do we’re going to do the calculation so it going to be the sales divided by the quantity the SLS quantity but here we have to make sure that we are not dividing by zero currently we don’t have any zeros in the quantity but you don’t know future you might get a zero and the whole code going to break so what you have to do is to go and say if you get any zero replace it with a null so null if if it is zero then make it null so that’s it now if the price is not null and the price is not negative or equal to zero then everything is fine and that’s why we’re going to have now the else it’s going to be the price as it is from The Source system so that’s it we’re going to say end as price so I’m totally happy with that let’s go and execute it and check of course so those are the old informations and those are the new transformed cleaned up informations so here previously we have a null but now we have two so two multiply with one we are getting two so the sales is here correct now moving on to the next one we have in the sales 40 but the price is two so two multiplied with one we should get two so the new sales is correct it is two and not 40 now to the next one over here the old sales is zero but if you go and multiply the four with the quantity you will get four so the sales here is not correct that’s why in the new sales we have it correct as a four and let’s go and get a minus so in this case we have a minus which is not correct so we are getting the price multiplied with one we should get here a nine and this sales here is correct now let’s go and get a scenario where the price is a null like this here so we don’t have here price but we calculated from the sales and the quantity so we divided the 10 by two and we have five so the new price is better and the same thing for the minuses so we have here minus 21 and in the output we have 21 which is correct so for now I don’t see any scenario where the data is wrong so everything looks better than before and with that we have applied the business rules from the experts and we have cleaned up the data in the data warehouse and this is way better than before because we are presenting now better data for analyzes and Reporting but it is challenging and you have exactly to understand the business so now what we’re going to do we’re going to go and copy those informations and integrate it in our query so instead of sales we’re going to get our new calculation and instead of the price we will get our correct calculation and here I’m missing the end let’s go and run the whole thing again so with that we have as well now cleaned sales quantity and price and it is following our business rules so with that we are done cleaning up the sales details The Next Step we’re going to go and inserted to the sales details but we have to go and check again the ddl so now all what you have to do is to compare those results with the ddl so the first one is the order number it’s fine the product key the customer ID but here we have an issue all those informations now are date and not an integer so we have to go and change the data type and with that we have better data type than before then the sales quantity price it is correct let’s go and drop the table and create it from scratch again and don’t forget to update your ddl script so that’s it for this and we’re going to go now and insert the results into our silver table say details and we have to go and list now all the columns I have already prepared the list of all the columns so make sure that you have the correct order of the columns so let’s go now and insert the data and with that and with that we can see that the SQL did insert data to our sales details but now very important is to check the health of the silver table so what we going to do instead here of bronze we’re going to go and switch it to Silver so let’s check over here so here always the order is smaller than the shipping and the due date which is really nice but now I’m very interested on the calculations so here we’re going to switch it from bronze to Silver and I’m going to go and get rid of all those calculations because we don’t need it this and now let’s see whether we have any issue well perfect our data is following the business rules we don’t have any nulls negative values zeros now as usual the last step the final check we will just have a final look to the table so we have the order number the product key the customer ID the three dates we have have the sales quantity and the price and of course we have our metadata column everything is perfect so now by looking to our code what are the different types of data Transformations that we are doing so in those three columns we are doing the following so at the start we are handling invalid data and this is as well type of transformation and as well at the same time we are doing data type casting so we are changing it to more correct data type and if you are looking to the sales over here then what we are doing over here is we are handling the missing data and as well the invalid data by deriving the column from already existing one and it is as well very similar for the price we are handling as well the invalid data by deriving it from specific calculation over here so those are the different types of data Transformations that you have done in these scripts all right now let’s keep moving to the next our system we have the customer AZ 12 so here we have we have like only three columns and let’s start with the ID first so here again we have the customers informations and if we go and check again our model you can see that we can connect this table with the CRM table customer info using the customer key so that means we have to go and make sure that we can go and connect those two tables so let’s go and check the other table we can go and check of course the silver layer so let’s query it and we can query both of the tables now we can see there is here like exract characters that are not included in the customer key from the CRM so let’s go and search for example for this customer over here where C ID like so we are searching for customer has similar ID now as you can see we are finding this customer but the issue is that we have those three characters in as there is no specifications or explanation why we have the nas so actually what we have to do is to go and remove those informations we don’t need it so let’s again check the data so it looks like the old data have an Nas at the start and then afterward we have new data without those three characters so we have to clean up those IDs in order to be able to connect it with other tables so we’re going to do it like this we’re going to start with the case wiin since we have like two scenarios in our data so if the C ID is like the three characters in as so if the ID start with those three characters then we’re going to go and apply transformation function otherwise eyes it’s going to stay like it is so that’s it so now we have to go and build the transformation so we’re going to use substring and then we have to define the string it’s going to be the C ID and then we have to define the position where it start cutting or extracting so we can say 1 2 3 and then four so we have to define the position number four and then we have to define the string how many characters should be extracted I will make it Dynamic so I will go with the link I will not go and count how much so we’re going to say the C ID so it looks good if it’s like an as then go and extract from the CID at the position number four the rest of the characters so let’s go and execute it and I’m missing here a comma again where we don’t have any Nas at the start and if you scroll down you can see those as well are not affected so with that we have now a nice ID to be joined with other table of course we can go and test it like this where and then we take the whole thing the whole transformation and say not in we remove of course the alas name we don’t need it and then we make very simple substring select distinct CST key the customer key from the silver table can be silver CRM cost info so that’s it let’s go and check so as you can see it is working fine so we are not able to find any unmatching data between the customer info from ERB and the CRM but of course after the transformation if you don’t use the transformation so if I just remove it like this we will find a lot of unmatching data so this means our transformation is working perfectly and we can go and remove the original value so that’s it for the First Column okay now moving on to the next field we have the birthday of their customers so the first thing to do is to check the data type it is a date so it’s fine it is not an integer or a string so we don’t have to convert anything but still there is something to check with the birth dates so we can check whether we have something out of range so for example we can go and check whether we have really old dates at the birth dates so let’s take 1900 and let’s say 24 and we can take the first date of the month so let’s go and check that well it looks like that we have customers that are older than a 100 Year well I don’t know maybe this is correct but it sounds of course strange to bit of the business of course this is Creed and he is in charge of something that is correct say hi to the kids hi kids yay and then we can go and check the other boundary where it is almost impossible to have a customer that the birthday is in the future so we can say birth date is higher than the current dates like this so let’s go and query this information well it will not work because we have to have like an or between them and now if we check the list over here we have dates that are invalid for the birth dates so all those dates they are all birthday in the future and this is totally unacceptable so this is an indicator for bad data quality of course you can go and report it to the source system in order to correct it so here it’s up to you what to do with those dates either leave it as it is as a bad data or we can go and clean that up by replacing all those dates with a null or maybe replacing only the one that is Extreme where it is 100% is incorrect so let’s go and write the transformation for that as usual we’re going to start with case whenn per dates is larger than the current date and time then null otherwise we can have an else where we have the birth dat as it is and then we have an end as birth date so let’s go and excuse it and with that we should not get any customer we the birthday in the future so that’s it for the birth dates now let’s move to the next one we have the gender now again the gender informations is localities so we have to go and check all the possible values inside this column so in order to check all the possible values we’re going to use select distinct gen from our table so let’s go and execute it and now the data doesn’t look really good so we have here a null we have an F we have here an empty string we have male female and again we have the m so this is not really good what we going to do we’re going to go and clean up all those informations in order to have only three values male female and not available so we’re going to do it like this we’re going to say case when and now we’re going to go and trim the values just to make sure there is like no empty spaces and as well I’m going to go and use the upper function just to make sure that in the future if we get any lower cases and so on we are covering all the different scenarios so case this is in F4 let’s say female then make it as female and we can go and do the same thing for the male like this so if it is an M or a male make sure it is capital letters because here we are using the upper then it is a male otherwise all other scenarios it should be not available so whether it is an empty string or nulls and so on so we have to have an end of course as gen so now let’s go and test it and check whether we have covered everything so you can see the m is now male the empty is not available the f is female the empty string or maybe spaces here is not available female going to stay as it is and the same for the male so with that we are covering all the scenarios and we are following our standards in the project so I’m going to go and cut this and put it in our original query over here so let’s go and execute the whole thing and with that we have cleaned up all those three columns now the question is did we change anything in the ddl well we didn’t change anything we didn’t introduce any new column or change any data type so that means the next step is we’re going to go and insert it in the server layer so as usual we’re going to say here insert into silver Erp the customer and then we’re going to go and list all the column names so C ID birth dat and the gender all right so let’s go and execute it and with that we can see it inserted all the data and of course the very important step as the next is to check that data quality so let’s go back to our query over here and change it from bronze to Silver so let’s go and check the silver layer well of course we are getting those very old customers but we didn’t change that we only change the birthday that is in the future and we don’t see it here in the results so that means everything is clean so for the next one let’s go and check the different genders and as you can see we have only those three values and of course we can go and take a final look to our table so you can see the C ID here the birth date the gender and then we see our metadata column and everything looks amazing so that’s it what are the different types of data Transformations that we have done first with the ID what you have done we have handled inv valid values so we have removed this part where it is not needed and the same thing goes for the birth dates we have handled as well invalid values and then for the last one for the gender we have done data normalizations by mapping the code to more friendly value and as well we have handled the missing values so those are the types that we have done in this code okay moving on to the second table we have the location informations so we have Erp location a101 so now here the task is easy because we have only two columns and if you go and check the integration model we can find our table over here so we can go and connect it together with the customer info from the other system using the CI ID with the customer key so those two informations must be matching in order to join the tables so that means we have to go and check the data so let’s go and select the data CST key from let’s go and get the silver Data customer info so let’s now if you go and check the result you can see over here that we have an issue with the CI ID there is like a minus between the characters and the numbers but the customer ID the customer number we don’t have anything that splits the characters with the numbers so if you go and join those two informations it will not be working so what we have to do we have to go and get rid of this minus because it is totally unnecessary so let’s go and fix that it’s going to be very simple so what we’re going to do we’re going to say C ID so we’re going to go and search for the m and replace it with nothing it’s very simple like this so let’s go and quer it again and with that things looks very similar to each others and as well we can go and query it so we’re going to say where our transformation is not in then we can go and use this as a subquery like this so let’s go and execute it and as you can see we are not finding any unmatching data now so that means our transformation is working and with that we can go and connect those two tables together so if I take take the transformation away you can see that we will find a lot of unmatching data so the transformation is okay we’re going to stay with it and now let’s speak about the countries now we have here multiple values and so on what I’m going to do this is low cardinality and we have to go and check all possible values inside this column so that means we are checking whether the data is consistent so we can do it like this distinct the country from our table I’m just going to go and copy it like this and as well I’m going to go s the data by the country so let’s go and check the informations now you can see we have a null we have an empty string which is really bad and then we have a full name of country and then we have as well an abbreviation of the countries well this is a mix this is not really good because sometimes we have the E and sometimes we have Germany and then we have the United Kingdom and then for the United States we have like three versions of the same information which is as well not really good so the quality of the is not really good so let’s go and work on the transformation as usual we’re going to start with the case win if trim country is equal to D then we’re going to transform it to Germany and the next one it’s going to be about the USA so if trim country is in so now let’s go and get those two values the US and the USA so us and USA then it’s going to be the United States States states so with that we have covered as well those three cases now we have to talk about the null and the empty string so we’re going to say when trim country is equal to empty string or country is null then it’s going to be not available otherwise I would like to get the country as it is so trim country just to make sure that we don’t have any leading or trailing spaces so that’s it let’s go and say this is the country so it is working and the country information is transformed and now what I’m going to do I’m going to take the whole new transformation and compare it to the old one let me just call this as old country and let’s go and query it so now we can check those value State as before so nothing did change the de is now Germany the empty string is not available the null the same thing and the United Kingdom State as like it’s like before and now we have one value for all those information so it’s only the United States so it looks perfect and with that we have cleaned as well the second column so with that we have now clean results and now the question did we change anything in the ddl well we haven’t changed anything both of them are varar so we can go now immediately and insert it into our table so insert into silver customer location and here we have to specify the columns it’s very simple the ID and the country so let’s go and execute it and as you can see we got now inserted all those values of course as a next we go and double check those informations I would just go and remove all those stuff as well here and instead of bronze let’s go with the silver so as you can see all the values of the country looks good and let’s have a final look to the table so like this so we have the IDS without the separator we have the countries and as well our metadata information so with that we have cleaned up the data for the location okay so now what are the different types of data transformation that we have done here is first we have handled invalid values so we have removed the minus with an empty string and for the country we have done data normalization so we have replaced codes with friendly values and as well at the same time we have handled missing values by replacing the empty string and null with not available and one more thing of course we have removed the unwanted spaces so those are the different types of transformation that we have done for this table okay guys now keep the energy up keep the spirit up we have to go and clean up the last table in the bronze layer and of course we cannot go and Skip anything we have to check the quality and to detect all the errors so now we have a table about the categories for the products and here we have like four columns let’s go and start with the first one the ID as you can see in our integration model we can connect this table together with the product info from the CRM using the product key and as as you remember in the silver layer we have created an extra column for that in the product info so if you go and select those data you can see we have a column called category ID and this one is exactly matching the ID that we have in this table and we have done the testing so this ID is ready to be used together with the other table so there is nothing to do over here and now for the next columns they are string and of course we can go and check whether there are any unwanted spaces so we are checking for The Unwanted spaces is so let’s go and check select star from and we’re going to go and get the same table like this here and first we are checking the category so the category is not equal to the category after trimming The Unwanted spaces so let’s go and execute it and as you can see we don’t have any results so there are no unwanted spaces let’s go and check the other column for example the subcategory the next one so let’s get the subcategory and the under query as well we don’t have anything so that means we don’t have unwanted spaces for the subcategory let’s go now and check the last column so I will just copy and paste now let’s get the maintenance and let’s go and execute and as well no results perfect we don’t have any unwanted spaces inside this table so now the next step is that we’re going to go and check the data standardizations because all those columns has low cardinality so what we’re going to do we’re going to say select this thing let’s get the cat category from our table I’ll just copy and paste it and check all values so as you can see we have the accessories bikes clothing and components everything looks perfect we don’t have to change anything in this column let’s go and check the subcategory and if you scroll down all values are friendly and nice as well nothing to change here and let’s go and check the last column the maintenance perfect we have only two values yes and no we don’t have any nulls so my friends that means this table has really nice data quality and we don’t have to clean up anything but still we have to follow our process we have to go and load it from the bronze to the silver even if we didn’t transform anything so our job is really easy here we’re going to go and say insert into silver dots Erp PX and so on and we’re going to go and Define The Columns so it’s going to be the ID the category sub category maintenance so that’s it let’s go and insert the data now as usual what we’re going to do we’re going to go and check the data so silver Erp PX let’s have a look all right so we can see the IDS are here the categories the subcategories the maintenance and we have our meta column so everything is inserted correctly all right so now I have all those queries and the insert statements for all six tables and now what is important before inserting any data we have to make sure that we are trating and emptying the table because if you run this qu twice what’s going to happen you will be inserting duplicates so first truncate the data and then do a full load insert all data so we’re going to have one step before it’s like the bronze layer we’re going to say trate table and then we will be trating the silver customer info and only after that we have to go and insert the data and of course we can go and give this nice information at the start so first we are truncating the table and then inserting so if I go and run the whole thing so let’s go and do it it will be working so if I can run it again we will not have any duplicates so we have to go and add this tip before each insert so let’s go and do that all right so I’m done with all tables so now let’s go and run everything so let’s go and execute it and we can see in the messaging everything working perfectly so with that we made all tables empty and then we inserted the data so perfect with that we have a nice script that loads the silver layer but of course like the bronze layer we’re going to put everything in one stored procedure so let’s go and do that we’ll go to the beginning over here and say create or alter procedure and we’re going to put it in the schema silver and using the naming convention load silver and we’re going to go over here and say begin and take the whole code end it is long one and give it one push with a tab and then at the end we’re going to say and perfect so we have our s procedure but we forgot here the US with that we will not have any error let’s go and execute it so the thir procedure is created if you go to the programmability and you will find two procedures load bronze and load silver so now let’s go and try it out all what you have to do is now only to execute the Silver Load silver so let’s execute the start procedure and with that we will get the same results this thir procedure now is responsible of loading the whole silver layer now of course the messaging here is not really good because we have learned in the bronze layer we can go and add many stuff like handling the error doing nce messaging catching the duration time so now your task is to pause the video take this thir procedure and go and transform it to be very similar to the bronze layer with the same messaging and all the add-ons that we have added so pause the video now I will do it as well offline and I will see you soon okay so I hope you are done and I can show you the results it’s like the bronze layer we have defined at the star few variables in order to catch the duration so we have the start time the end time patch start time and Patch end time and then we are printing a lot of stuff in order to have like nice messaging in the outut so at the start we are saying loading the server layer and then we start splitting by The Source system so loading the CRM tables and I’m going to show you only one table for now so we are setting the timer so we are saying start time get the dat date and time informations to it then we are doing the usual we are truncating the table and then we are inserting the new informations after cleaning it up and we have this nice message where we say load duration where we are finding the differences between the start time and the end time using the function dat diff and we want to show the result in the seconds so we are just printing how long it took to load this table and we’re going to go and repeat this process for all the tables and of course we are putting everything in try and Cat so the SQL going to go and try to execute the tri part and if there are any issues the SQL going to go and execute the catch and here we are just printing few information like the error message the error number and the error States and we are following exactly the same standard at the bronze layer so let’s go and execute the whole thing and with that we have updated the definition of the S procedure let’s go now and execute it so execute silver do load silver so let’s go and do that it went very fast like few than 1 second again because we are working on local machine loading the server layer loading the CRM tables and we can see this nice messaging so it start with trating the table inserting the data and we are getting the load duration for this table and you will see that everything is below 1 second and that’s because at in real project you will get of course more than 1 second so at the end we have low duration of the whole silver layer and now I have one more thing for you let’s say that you are changing the design of this thr procedure for the silver layer you are adding different types of messaging or maybe are creating logs and so on so now all those new ideas and redesigns that you are doing for the silver layer you have always to think about bringing the same changes as well in the other store procedure for the pron layer so always try to keep your codes following the same standards don’t have like one idea in One S procedure and an old idea in another one always try to maintain those scripts and to keep them all up to date following the same standards otherwise it can to be really hard for other developers to understand the cause I know that needs a lot of work and commitments but this is your job to make everything following the best practices and following the same naming convention and standards that you put for your projects so guys now we have very nice two ETL scripts one that loads the pron layer and another one for the server layer so now our data bear house is very simple all what you have to do is to run first the bronze layer and with that we are taking all the data from the CSV files from the source and we put it inside our data warehouse in the pron layer and with that we are refreshing the whole bronze layer once it’s done the next step is to run the start procedure of the servey layer so once you executed you are taking now all the data from the bronze layer transforming it cleaning it up and then loading it to the server layer and as you can see the concept is very simple we are just moving the data from one layer another layer with different tasks all right guys so as you can see in the silver layer we have done a lot of data Transformations and we have covered all the types that we have in the data cleansing so we remove duplicates data filtering handling missing data invalid data unwanted spaces casting the data types and so on and as well we have derived new columns we have done data enrichment and we have normalized a lot of data so now of course what we have not done yet business rules and logic data aggregations and data integration this is for the next layer all right my friends so finally we are done cleaning up the data and checking the quality of our data so we can go and close those two steps and now to the next step we have to go and extend the data flow diagram so let’s go okay so now let’s go and extend our data flow for the silver layer so what I’m going to do I’m just going to go and copy the whole thing and put it side by side to the bronze layer and let’s call it silver layer and the table names going to stay as before because we have like one to one like the bronze layer but what we’re going to do we’re going to go and change the coloring so I’m going to go and Mark everything and make it gray like silver and of course what is very important is to make the lineage so I’m going to go now from the bronze and take an arrow and put it to the server table and now with that we have like a lineage between three layers and you are checking this table the customer info you can understand aha this comes from the bronze layer from the customer info and as well this comes from the source system CRM so now you can see the lineage between different layers and without looking to any scripts and so on in one picture you can understand the whole projects so I don’t have to explain a lot of stuff by just looking to this picture you can understand how the data is Flowing between sources bronze layer silver layer and to the gold layer of course later so as you can see it looks really nice and clean all right so with that we have updated the data flow next we’re going to go and commit our work in the get repo so let’s go okay so now let’s go and commit our scripts we’re going to go to the folder scripts and here we have a server layer if you don’t have it of course you can go and create it so first we’re going to go and put the ddl scripts for the server layer so let’s go and I will paste the code over here and as usually we have this comment at the header explaining the purpose of this scripts so let’s go and commit our work work and we’re going to do the same thing for the start procedure that loads the silver layer so I’m going to go over here I have already file for that so let’s go and paste that so we have here our stored procedures and as usual at the start we have as well so this script is doing the ETL process where we load the data from bronze into silver so the action is to truncate the table first and then insert transformed cleans data from bronze to Silver there are no parameters at all and this is how you can use the start procedure okay so we’re going to go and commit our work and now one more thing that we want to commit in our project all those quaries that you have built to check the quality of the server layer so this time we will not put it in the scripts we’re going to go to the tests and here we’re going to go and make a new file called quality checks silver and inside it we’re going to go and paste all the queries that we have filled I just here reorganize them by the tables so here we can see all the checks that we have done during the course and at the header we have here nice comments so here we are just saying that this script is going to check the quality of the server layer and we are checking for nulls duplicates unwanted spaces invalid date range and so on so that each time you come up with a new quality check I’m going to recommend you to share it with the project and with other team in order to make it part of multiple checks that you do after running the atls so that’s it I’m going to go and put those checks in our repo and in case I come up with new check I’m going to go and update it perfect so now we have our code in our repository all right so with that our code is safe and we are done with the whole epic so we have build the silver layer now let’s go and minimize it and now we come to my favorite layer the gold layer so we’re going to go and build it the first step as usual we have to analyze and this time we’re going to explore the business objects so let’s go all right so now we come to the big question how we going to build the gold layer as usual we start with analyzing so now what we’re going to do here is to explore and understand what are the main business objects that are hidden inside our source system so as you can see we have two sources six files and here we have to identify what are the business objects once we have this understanding then we can start coding and here the main transformation that we are doing is data integration and here usually I split it into three steps the first one we’re going to go and build those business objects that we have identified and after we have a business object we have to look at it and decide what is the type of this table is it a dimension is it a fact or is it like maybe a flat table so what type of table that we have built and the last step is of course we have now to rename all the columns into something friendly and easy to understand so that our consumers don’t struggle with technical names so once we have all those steps what we’re going to do it’s time to validate what we have created so what we have to do the new data model that we have created it should be connectable and we have to check that the data integration is done correctly and once everything is fine we cannot skip the last step we have to document and as well commit our work in the git and here we will be introducing new type of documentations so we’re going to have a diagram about the data model we’re going to build a data dictionary where we going to describe the data model and of course we can extend the data flow diagram so this is our process those are the main steps that we will do in order to build the gold layer okay so what is exactly data modeling usually usually the source system going to deliver for you row data an organized messy not very useful in its current States but now the data modeling is the process of taking this row data and then organize it and structure it in meaningful way so what we are doing we are putting the data in a new friendly and easy to understand objects like customers orders products each one of them is focused on specific information and what is very important is we’re going to describe the relationship between those objects so by connecting them using lines so what you have built on the right side we call it logical data model if you compare to the left side you can see the data model makes it really easy to understand our data and the relationship the processes behind them now in data modeling we have three different stages or let’s say three different ways on how to draw a data model the first stage is the conceptual data model here the focus is only on the entity so we have customers orders products and we don’t go in details at all so we don’t specify any columns or attributes inside those boxes we just want to focus what are the entities that we have and as well the relationship between them so the conceptual data model don’t focus at all on the details it just gives the big picture so the second data model that we can build is The Logical data model and here we start specifying what are the different columns that we can find in each entity like we have the customer ID the first name last name and so on and we still draw the relationship between those entities and as well we make it clear which columns are the primary key and so on so as you can see we have here more details but one thing we don’t describe a lot of details for each column and we are not worry how exactly we going to store those tables in the database the third and last stage we have the physical data model this is where everything gets ready before creating it in the database so here you have to add all the technical details like adding for each column the data types and the length of each data type and many other database techniques and details so again if if you look to the conceptual data model it gives us the big picture and in The Logical data model we dive into details of what data we need and the physical layer model prepares everything for the implementation in the database and to be honest in my projects I only draw the conceptual and The Logical data model because drawing and building the physical data model needs a lot of efforts and time and there are many tools like in data bricks they automatically generate those models so in this project what we’re going to do we’re going to draw The Logical data model for the gold layer all right so now for analytics and specially for data warehousing and business intelligence we need a special data model that is optimized for reporting and analytics and it should be flexible scalable and as well easy to understand and for that we have two special data models the first type of data model we have the star schema it has a central fact table in the middle and surrounded by Dimensions the fact table contains transactions events and the dimensions contains descriptive informations and the relationship between the fact table in the middle and the dimensions around it forms like a star shape and that’s why we call it star schema and we have another data model called snowflake schema it looks very similar to the star schema so we have again the fact in the middle and surrounded by Dimensions but the big difference is that we break the dimensions into smaller subdimensions and the shape of this data model as you are extending the dimensions it’s going to look like a snowflake so now if you compare them side by side you can see that the star schema looks easier right so it is usually easy to understand easy to query it is really perfect for analyzes but it has one issue with that the dimension might contain duplicates and your Dimensions get bigger with the time now if you compare to the snowflake you can see the schema is more complex you so you need a lot of knowledge and efforts in order to query something from the snowflake but the main advantage here comes with the normalization as you are breaking those redundancies in small tables you can optimize the storage but to be honest who care about the storage so for this project I have chose to use the star schema because it is very commonly used perfect for reporting like for example if you’re using power pii and we don’t have to worry about the storage so that’s why we going to adapt this model to build our gold layer okay so now one more thing about those data models is that they contain two types of tables fact and dimensions so when I I say this is a fact table or a dimension table well the dimension contains descriptive informations or like categories that gives some context to your data for example a product info you have product name category subcategories and so on this is like a table that is describing the product and this we call it Dimension but in the other hand we have facts they are events like transactions they contain three important informations first you have multiple IDs from multiple dimensions then we have like the informations like when the transaction or the event did happen and the third type of information you’re going to have like measures and numbers so if you see those three types of data in one table then this is a fact so if you have a table that answers how much or how many then this is a fact but if you have a table that answers who what where then this is a dimension table so this is what dimension and fact tables all right my friends so so far in the bronze layer and in the silver layer we didn’t discuss anything about the business so the bronze and silver were very technical we are focusing on data Eng gestion we are focusing on cleaning up the data quality of the data but still the tables are very oriented to the source system now comes the fun part in the god layer where we’re going to go and break the whole data model of the sources so we’re going to create something completely new to our business that is easy to consume for business reporting and analyzes and here it is very very important to have a clear understanding of the business and the processes and if you don’t know it already at this phase you have really to invest time by meeting maybe process experts the domain experts in order to have clear understanding what we are talking about in the data so now what we’re going to do we’re going to try to detect what are the business objects that are hidden in the source systems so now let’s go and explore that all right now in order to build a new data model I have to understand first the original data model what are the main business objects that we have how things are related to each others and this is very important process in building a new model so now what I usually do I start giving labels to all those tables so if you go to the shapes over here let’s go and search for label and if you go to more icons I’m going to go and take this label over here so drag and drop it and then I’m going to go and increase maybe the size of the font so let’s go with 20 and bold just make it a little bit bigger so now by looking to this data model we can see that we have a bradu for informations in the CRM and as well in the ARP and then we have like customer informations and transactional table so now let’s focus on the product so the product information is over here we have here the current and the history product informations and here we have the categories that’s belong to the products so in our data model we have something called products so let’s go and create this label it’s going to be the products and so let’s go and give it a color to the style let’s Pi for example the red one now let’s go and move this label and put it beneath this table over here that I have like a label saying this table belongs to the objects called products now I’m going to do the same thing for the other table over here so I’m going to go and tag this table to the product as well so that I can see easily which tables from the sources does has informations about the product business object all right now moving on we have here a table called customer information so we have a lot of information about the customer we have as well in the ARB customer information where we have the birthday and the country so those three tables has to do with the object customer so that means we’re going to go and label it like that so let’s call it customer and I’m going to go and pick different color for that let’s go with the green so I will tag this table like this and the same thing for the other tables so copy tag the second table and the third table now it is very easily for me to see which table to belong to which business objects and now we have the final table over here and only one table about the sales and orders in the ARB we don’t have any informations about that so this one going to be easy let’s call it sales and let’s move it over here and as well maybe change the color of that to for example this color over here now this step is very important by building any data model in the gold layer it gives you a big picture about the things that you are going to module so now the next step with that we’re going to go and build those objects step by step so let’s start with the first objects with our customers so here we we have three tables and we’re going to start with the CRM so let’s start with this table over here all right so with that we know what are our business objects and this task is done and now in The Next Step we’re going to go back to SQL and start doing data Integrations and building completely new data model so let’s go and do that now let’s have a quick look to the gold layer specifications so this is the final stage we’re going to provide data to be consumed by reporting and Analytics and this time we will not be building tables we will be using views so that means we will not be having like start procedure or any load process to the gold layer all what you are doing is only data transformation and the focus of the data transformation going to be data integration aggregation business logic and so on and this time we’re going to introduce a new data model we will be doing star schema so those are the specifications for the gold layer and this is our scope so this time we make sure that we are selecting data from the silver layer not from the bronze because the bronze has bad data quality and the server is everything is prepared and cleaned up in order to build the good layer going to be targeting the server layer so let’s start with select star from and we’re going to go to the silver CRM customer info so let’s go and hit execute and now we’re going to go and select the columns that we need to be presented in the gold layer so let’s start selecting The Columns that we want we have the ID the key the first name I will not go and get the metadata information this only belongs to the Silver Perfect the next step is that I’m going to go and give this table an ilas so let’s go and call it CI and I’m going to make sure that we are selecting from this alas because later we’re going to go and join this table with other tables so something like this so we’re going to go with those columns now let’s move to the second table let’s go and get the birthday information so now we’re going to jump to the other system and we have to join the data by the CI ID together with the customer key so now we have to go and join the data with another table and here I try to avoid using the inner join because if the other table doesn’t have all the information about the customers I might lose customers so always start with the master table and if you join it with any other table in order to get informations try always to avoid the inner join because the other source might not have all the customers and if you do inner join you might lose customers so iend to start from the master table and then everything else is about the lift join so I’m going to say Lift join silver Erp customer a z12 so let’s give it the ls CA and now we have to join the tables so it’s going to be by C from the first table it going to be the customer key equal to ca and we have the CI ID now of course we’re going to get matching data because we checked the silver layer but if we haven’t prepared the data in the silver layer we have to do here preparation step in order to join Jo the tables but we don’t have to do that because that was a preep in the silver layer so now you can see the systematic that we have in this pron silver gold so now after joining the tables we have to go and pick the information that we need from the second table which is the birth dat so B dat and as well from this table there is another nice information it is the gender information so that’s all what we need from the second table let’s go and check the third table so the third table is about the location information the countries and as well we connect the tables by the C ID with the key so let’s go and do that we’re going to say as well left join silver Erp location and I’m going to give it the name LA and then we have to join while the keys the same thing it’s going to be CI customer key equal to La a CI ID again we have prepared those IDs and keys in the server layer so the joint should be working now we have to go and pick the data from the second table so what do we we have over here we have the ID the country and the metadata information so let’s go and just get the country perfect so now with that we have joined all the three tables and we have picked all the columns that we want in this object so again by looking over here we have joined this table with this one and this one so with that we have collected all the customer informations that we have from the two Source systems okay so now let’s go and query in order to make sure that we have everything correct and in order to understand that your joints are correct you have to keep your eye in those three columns so if you are seeing that you are getting data that means you are doing the the joints correctly but if you are seeing a lot of nulls or no data at all that means your joints are incorrect but now it looks for me it is working and another check that I do is that if your first table has no duplicates what could happen is that after doing multiple joints you might now start getting dgates because the relationship between those tables is not clear one to one you might get like one to many relationship or many to many relationships so now the check that I usually do at this stage advance I have to make sure that I don’t have duplicates from their results so we don’t have like multiple rows for the same customer so in order to do that we go and do a quick group bu so we’re going to group by the data by the customer ID and then we do the counts from this subquery so this is the whole subquery and then after that we’re going to go and say Group by the customer ID and then we say having counts higher than one so this query actually try to find out whether we have any duplicates in the primary key so let’s go and executed we don’t have any duplicate and that means after joining all those tables with the customer info those tables didn’t didn’t cause any issues and it didn’t duplicate my data so this is very important check to make sure that you are in the right way all right so that means everything is fine about the D Kates we don’t have to worry about it now we have here an integration issue so let’s go and execute it again and now if you look to the data we have two sources for the gender informations one comes from the CRM and another where come from the Erp so now the question is what are we going to do with this well we have to do data integration so let me show you how I do it first I go and have a new query and then I’m going to go and remove all other stuff and I’m going to leave only those two informations and use it distinct just to focus on the integration and let’s go and execute it and maybe as well to do an order bu so let’s do one and two let’s go and execute it again so now here we have all the scenarios and we can see sometimes there is a matching so from the first table we have female and the other table we have as well female but sometimes we have an issue like those two tables are giving different informations and the same thing over here so this is as well an issue different informations another scenario where we have a from the first table like here we have the female but in the other table we have not available well this is not a problem so we can get it from the first table but we have as well the exact opposite scenario where from the first table the data is not available but it is available from the second table and now here you might wonder why I’m getting a null over here we did handle all the missing data in the silver layer and we replace everything with not available so why we are still getting a null this null doesn’t come directly from the tables it just come because of joining tables so that means there are customers in the CRM table that is not available in the Erb table and if there is like no match what’s going to happen we will get a null from scel so this null means there was no match and that’s why we are getting this null it is not coming from the content of the tables and this is of course an issue but now the big issue what can happen for those two scenarios here we have the data but they are different and here again we have to ask the experts about it what is the master here is it the CRM system or the ARP and let’s say from their answer going to say the master data for the customer information is the CRM so that means the CRM informations are more accurate than the Erp information and this is only about the customers of course so for this scenario where we have female and male then the correct information is the female from the First Source system the same goes over here and here we have like male and female then the correct one is is the mail because this Source system is the master okay so now let’s go and build this business rule we’re going to start as usual with the case wi so the first very important rule is if we have a data in the gender information from the CRM system from the master then go and use it so we’re going to go and check the gender information from the CRM table so customer gender is not equal to not available so that means we have a value male or female let me just have here a comma like this then what going to happen go and use it so we’re going to use the value from the master CRM is the master for gender info now otherwise that means it is not available from the CRM table then go and use and grab the information from the second table so we’re going to say ca gender but now we have to be careful this null over here we have to convert it to not available as well so we’re going to use the Calis so if this is a null then go and use the not available like this so that’s it let’s have an end let me just push this over here so let’s go and call it new chin for now let’s go and excute it and let’s go and check the different scenarios all those values over here we have data from the CRM system and this is as well represented in the new column but now for the second parts we don’t have data from the first system so we are trying to get it from the second system so for the first one is not available and then we try to get it from the Second Source system so now we are activating the else well it is null and with that the CIS is activated and we are replacing the null with not available for the second scenario as well the first system don’t have the gender information that’s why we are grabbing it from the second so with that we have a female and then the third one the same thing we don’t have information but we get it from the Second Source system we have the mail and the last one it is not available in in both Source systems that’s why we are getting not available so with that as you can see we have a perfect new column where we are integrating two different Source system in one and this is exactly what we call data integration this piece of information it is way better than the source CRM and as well the source ARP it is more rich and has more information and this is exactly why we Tred to get data from different Source system in order to get rich information in the data warehouse so do we have a nice logic and as you can see it’s way easier to separate it in separate query in order first to build the logic and then take it to the original query so what I’m going to do I’m just going to go and copy everything from here and go back to our query I’m going to go and delete those informations the gender and I will put our new logic over here so a comma and let’s go and execute so with that we have our new nice column now with that we have very nice objects we don’t have delates and we have integrated data together so we took three three tables and we put it in one object now the next step is that we’re going to go and give nice friendly names the rule in the gold layer that to use friendly names and not to follow the names that we get from The Source system and we have to make sure that we are following the rules by the naming conventions so we are following the snake case so let’s go and do it step by step for the first one let’s go and call it the customer ID and then the next one I will get rid of using keys and so on I’m going to go and call it customer number because those are customer numbers then for the next one we’re going to call it first name without using any prefixes and the next one last name and we have here marital status so I will be using the exact name but without the prefix and here we just going to call it gender and this one we going to call it create date and this one birth dat and the last one going to be the country so let’s go and execute it now as you can see the names are really friendly so we have customer ID customer numbers first name last name material status gender so as you can see the names are really nice and really easy to understand now the next step I’m going to think about the order of those columns so the first two it makes sense to have it together the first name last name then I think the country is very important information so I’m going to go and get it from here and put it exactly after the last name it’s just nicer so let’s go and execute it again so the first name last name country it’s always nice to group up relevant columns together right so we have here the status of the gender and so on and then we have the CATE date and the birth date I think I’m going to go and switch the birth date with the CATE date it’s more important than the CATE dates like this and here not forget a comma so execute again so it looks wonderful now comes a very important decision about this objects is it a fact table or a dimension well as we learned Dimensions hold descriptive information about an object and as you can see we have here a descriptions about the customers so all those columns are describing the customer information and we don’t have here like transactions and events and we don’t have like measures and so on so we cannot say this object is a fact it is clearly a dimension so that’s why we’re going to go and call this object the dimension customer now there is one thing that if you creating a new dimension you need always a primary key for the dimension of course we can go over here and the depend on the primary key that we get from The Source system but sometimes you can have like Dimensions where you don’t have like a primary key that you can count on so what we have to do is to go and generate a new primary key in the data warehouse and those primary Keys we call it surrogate keys serate keys are system generated unique identifier that is assigned to each record to make the record unique it is not a business key it has no meaning and no one in the business knows about it we only use it in order to connect our data model and in this way we have more control on how to connect our data model and we don’t have to depend all way on the source system and there are different ways on how to generate surrogate Keys like defining it in the ddl or maybe using the window function row number in this data warehouse I’m going to go with a simple solution where we’re going to go and use the window function so now in order to generate a Sur key for this Dimension what we’re going to do it is very simple so we’re going to say row number over and here if we have to order by something you can order by the create date or the customer ID or the customer number whatever you want but in this example I’m going to go and order by the customer ID so we have to follow the naming convention that’s all surate keys with the key at the end as a suffix so now let’s go and query those informations and as you can see at the start we have a customer key and this is a sequence we don’t have here of course any duplicates and now this sgate key is generated in the data warehouse and we going to use this key in order to connect the data model so now with that our query is ready and the last step is that we’re going to go and create the object and as we decided all the objects in the gold layer going to be a virtual one so that means we’re going to go and create a view so we’re going to say create View gold. dim so follow damic convention stand for the dimension and we’re going to have the customers and then after that we have us so with that everything is ready let’s go and excuse it it was successful let’s go to the Views now and you can see our first objects so we have the dimension customers in the gold layer now as you know me in the next of that we’re going to go and check the quality of this new objects so let’s go and have a new query so select star from our view temp customers and now we have to make sure that everything in the right position like this and now we can do different checks like the uniqueness and so on but I’m worried about the gender information so let’s go and have a distinct of all values so as you can see it is working perfectly we have only female male and not available so that’s it with that we have our first new dimension okay friends so now let’s go and build the second object we have the products so as you can see product information is available in both Source systems as usual we’re going to start with the CRM informations and then we’re going to go and join it with the other table in order to get the category informations so those are the columns that we want from this table now we come here to a big decision about this objects this objects contains historical informations and as well the current informations now of course depend on the requirement whether you have to do analysis on the historical informations but if you don’t have such a requirements we can go and stay with only the current informations of the products so we don’t have to include all the history in the objects and it is anyway as we learned from the model over here we are not using the primary key we are using the product key so now what we have to do is to filter out the historical data and to stay only with the current data so we’re going to have here aware condition and now in order to select the current data what we’re going to do we’re going to go and Target the end dates if the end date is null that means it is a current data let’s take this example over here so you can see here we have three record for the same product key and for the first two records we have here an information in the end dates because it is historical informations but the last record over here we have it as a null and that’s because this is the current information it is open and it’s not closed yet so in order to select only the current informations it is very simple we’re going to say BRD in dat is null so if you go now and execute it you will get only the current products you will not have any history and of course we can go and add comment to it filter out all historical data and this means of course we don’t need the end date in our selection of course because it is always a null so with that we have only the current data now the next step that we have to go and join it with the product categories from the Erp and we’re going to use here the ID so as usual the master information is the CRM and everything else going to be secondary that’s why I use the Live join just to make sure I’m not losing I’m not filtering any data because if there is no match then we lose data so let’s join silver Erp and the category so let’s call it PC and now what we’re going to do we’re going to go and join it using the key so PN from the CRM we have the category ID equal to PC ID and now we have to go and pick columns from the second table so it’s going to be the PC we have the category very important PC we have the subcategory and we can go and get the maintenance so something like this let’s go and query and with that we have all those columns comes from the first table and those three comes from the second so with that we have collected all the product informations from the two Source systems now the next step is we have to go and check the quality of these results and of course what is very important is to check the uniqueness so what we’re going to do we’re going to go and have the following query I want to make sure that the product key is unique because we’re going to use it later in order to join the table with the sales so from and then we have to have group by product key and we’re going to say having counts higher than one so let’s go and check perfect we don’t have any duplicates the second table didn’t cause any duplicates for our join and as well this means we don’t have historical data and each product is only one records and we don’t have any duplicates so I’m really happy about that so let’s go in query again now of course the next step do we have anything to integrate together do we have the same information twice well we don’t have that the next step is that we’re going to go and group up the relevant informations together so I’m going to say the product ID then the product key and the product name are together so all those three informations are together and after that we can put all the category informations together so we can have the category ID the category itself the subcategory let me just query and see the results so we have the product ID key name and then we have the category ID name and the subcategory and then maybe as well to put the maintenance after the subcategory like this and I think the product cost and the line can start could stay at the end so let me just check so those three four informations about the category and then we have the cost line and the start date I’m really happy with that the next step we’re going to go and give n names friendly names for those columns so let’s start with the first one this is the product ID the next one going to be the product number we need the key for the surrogate key later and then we have the product name and after that we have the category ID and the category and this is the subcategory and then the next one going to stay as it is I don’t have to rename it the next one going to be the cost and the line and the last one will be the start dates so let’s go and execute it now we can see very nicely in the output all those friendly names for the columns and it looks way nicer than before I don’t have even to describe those informations the name describe it so perfect now the next big decision is what do we have here do we have a effect or Dimension what do you think well as you can see here again we have a lot of descriptions about the products so all those informations are describing the business object products we don’t have like here transactions events a lot of different keys and ideas so we don’t have really here a facts we have a dimension each row is exactly describing one object describing one products that’s why this is a dimension okay so now since this is a dimension we have to go and create a primary key for it well actually the surrogate key and as we have done it for the customers we’re going to go and use the window function row number in order to generate it over and then we have to S the data I will go with the start dates so let’s go with the start dates and as well the product key and we’re going to gra it a name products key like this so let’s go and execute it with that we have now generated a primary key for each product and we’re going to be using it in order to connect our data model all right now the next step we does we’re going to go and build the view so we’re going to say create view we’re going to say go and dimension products and then ask so let’s go and create our objects and now if you go and refresh the views you will see our second object the second dimension so we have here in the gold layer the dimension products and as usual we’re going to go and have a look to this view just to make sure that everything is fine so them products so let’s execute it and by looking to the data everything looks nice so with that we have now two dimensions all right friends so with that we have covered a lot of stuff so we have covered the customers and the products and we are left with only one table where we have the transactions the sales and for the sales information we have only data from the CRM we don’t have anything from the Erp so let’s go and build it okay so now I have all those informations and now of course we have only one table we don’t have to do any Integrations and so on and now we have to answer the big question do we have here a dimension or a fact well by looking to those details we can see transactions we can see events we have a lot of dates informations we have as well a lot of measures and metrics and as well we have a lot of IDs so it is connecting multiple dimensions and this is exactly a perfect setup for effects so we’re going to go and use those informations as effects and of course as we learned effect is connecting multiple Dimensions we have to present in this fact the surrogate keys that comes from the dimensions so those two informations the product key and the customer ID those informations comes from the searce system and as we learned we want to connect our data model using the surate keys so what we’re going to do we’re going to replace those two informations with the surate keys that we have generated and in order to do that we have to go and join now the two dimensions in order to get the surate key and we call this process of course data lookup so we are joining the tables in order only to get one information so let’s go and do that we will go with the lift joint of course not to lose any transaction so first we’re going to go and join it with the product key now of course in the silver layer we don’t have any ciruit Keys we have it in the good layer so that means for the fact table we’re going to be joining the server layer together with the gold layer so gold dots and then the dimension products and I’m going to just call it PR and we’re going to join the SD using the product key together with the product number [Music] from the dimension and now the only information that we need from the dimension is the key the sget key so we’re going to go over here and say product key and what I’m going to do I’m going to go and remove this information from here because we don’t need it we don’t need the original product key from The Source system we need the circuit key that we have generated in our own in this data warehouse so the same thing going to happen as well for the customer so gold Dimension customer again again we are doing here a look up in order to get the information on SD so we are joining using this ID over here equal to the customer ID because this is a customer ID and what we’re going to do the same thing we need the circuit key the customer key and we’re going to delete the ID because we don’t need it now we have the circuit key so now let’s go and execute it and now with that we have in our fact table the two keys from the dimensions and now this can help us to connect the data model to connect the facts with the dimensions so this is very necessary Step Building the fact table you have to put the surrogate keys from the dimensions in the facts so that was actually the hardest part building the facts now the next step all what you have to do is to go and give friendly names so we’re going to go over here and say order number then the surrogate keys are already friendly so we’re going to go over here and say this is the order date and the next one going to be shipping date and then the next one due date and the sales going to be I’m going to say sales amount the quantity and the final one is the price so now let’s go and execute it and look to the results so now as you can see the columns looks very friendly and now about the order of the columns we use the following schema so first in the fact table we have all the surrogate keys from the dimensions then second we have all the dates and at the end you group up all the measures and the matrics at the end of The Facts so that’s it for the query for the facts now we can go and build it so we’re going to say create a view gold in the gold layer and this time we’re going to use the fact underscore and we’re going to go and call it sales and then don’t forget about the ass so that’s it let’s go and create it perfect now we can see the facts so with that we have three objects in the gold layer we have two dimensions and one and facts and now of course the next step with this we’re going to go and check the quality of the view so let’s have a simple select fact sales so let’s execute it now by checking the result you can see it is exactly like the result from the query and everything looks nice okay so now one more trick that I usually do after building a fact is try to connect the whole data model in order to find any issues so let’s go and do that we will do just simple left join with the dimensions so gold Dimension customers C and we will use the [Music] keys and then we’re going to say where customer key is null so there is no matching so let’s go and execute this and with that as you can see in the results we are not getting anything that means everything is matching perfectly and we can do as well the same thing with the products so left join C them products p on product key and then we connect it with the facts product key and then we going to go and check the product key from the dimension like this so we are checking whether we can connect the facts together with the dimension products let’s go and check and as you can see as well we are not getting anything and this is all right so with that we have now SQL codes that is tested and as well creating the gold layer now in The Next Step as you know in our requirements we have to make clear documentations for the end users in order to use our data model so let’s go and draw a data model of the star schema so let’s go and draw our data model let’s go and search for a table and now what I’m going to do I’m going to go and take this one where I can say what is the primary key and what is the for key and I’m going to go and change little bit the design so it’s going to be rounded and let’s say I’m going to go and change to this color and maybe go to the size make it 16 and then I’m going to go and select all the columns and make it as well 16 just to increase the size and then go to our range and we can go and increase it 39 so now let’s go and zoom in a little bit for the first table let’s go and call it gold Dimension customers and make it a little bit bigger like this and now we’re going to go and Define here the primary key it is the customer key and what else we’re going to do we’re going to go and list all the columns in the dimension is little bit annoying but the results going to be awesome so what do we we have the customer ID we have the customer number and then we have the first name now in case you want a new rows so you can hold control and enter and you can go and add the other columns so now pause the video and then go and create the two Dimensions the customers and the products and add all the columns that you have built in the [Music] view welcome back so now I have those two Dimensions the third one one going to be the fact table now for the fact table I’m going to go with different color for example the blue and I’m going to go and put it in the middle something like this so we’re going to say gold fact sales and here for that we don’t have primary key so we’re going to go and delete it and I have to go and add all The Columns of the facts so order number products key customer key okay all right perfect now what we can do we can go and add the foreign key information so the product key is a foreign key key for the products so you’re going to say fk1 and the customer key going to be the foreign key for the customers so fk2 and of course you can go and increase the spacing for that okay so now after we have the tables the next step in data modeling is to go and describe the relationship between these tables this is of course very important for reporting and analytics in order to understand how I’m going to go and use the data model and we have different types of relationships we have one to one one too many and in Star schema data model the relationship between the dimension and the fact is one too many and that’s because in the table customers we have for a specific customer only one record describing the customer but in the fact table the customer might exist in multiple records and that’s because customers can order multiple times so that’s why in fact it is many and in the dimension side it is one now in order to see all those relationships we’re going to go to the menu to the left side and as you can see we have here entity relations and now you have different types of arrows so here for example we have zero to many one one to many one to one and many different types of relations so now which one we going to take we’re going to go and pick with this one so it says one mandatory so that means the customer must exist in the dimension table too many but it is optional so here we have three scenarios the customer didn’t order anything or the customer did order only once or the customer did order many things so that’s why in the fact table it is optional so we’re going to take this one and place it over here so we’re going to go and connect this part to the customer Dimension and the many parts to the facts well actually we have to do it on the customers so with that we are describing the relationship between the dimensions and fact with one to many one is mandatory for the customer Dimension and many is optional to the facts so we have the same story as well for the products so the many part to the facts and the one goes to the products so it’s going to look like this each time you are connecting new dimension to the fact table it is usually one too many relationship so you can go and add anything you want to this model like for example a text like explaining something for example if you have some complicated calculations and so on you can go and write this information over here so for example we can say over here sales calculation we can make it a little bit smaller so let’s go with 18 so we can go and write here the formula for that so sales equal quantity multipli with a price and make this a little bit bigger so it is really nice info that we can add it to the data model and even we can go and Link it to the column so we can go and take this arrow for example with it like this and Link it to the column and with that you have as well nice explanation about the business rule or the calculation so you can go and add any descriptions that you want to the data model just to make it clear for anyone that is using your data model so with that you don’t have only like three tables in the database you have as well like some kind of documentations and explanation in one Blick we can see how the data model is built and how you can connect the tables together it is amazing really for all users of your data model all right so now with that we have really nice data model and now in The Next Step we’re going to go and create quickly a data catalog all right great so with that we have a data model and we can say we have something called a data products and we will be sharing this data product with different type of users and there’s something that’s every every data product absolutely needs and that is the data catalog it is a document that can describe everything about your data model The Columns the tables maybe the relationship between the tables as well and with that you make your data product clear for everyone and it’s going to be for them way easier to derive more insights and reports from your data product and what is the most important one it is timesaving because if you don’t do that what can happen each consumer each user of your data product will keep asking you the same question questions about what do you mean with this column what is this table how to connect the table a with the table B and you will keep repeating yourself and explaining stuff so instead of that you prepare a data catalog a data model and you deliver everything together to the users and with that you are saving a lot of time and stress I know it is annoying to create a data catalog but it is Investments and best practices so now let’s go and create one okay so now in order to do that I’ve have created a new file called Data catalog in the folder documents and here what we’re going to do is very St straightforwards we’re going to make a section for each table in the gold layer so for example we have here the table dimension customers what you have to do first is to describe this table so we are saying it stores details about the customers with the demographics and Geographics data so you give a short description for the table and then after that you’re going to go and list all your columns inside this table and maybe as well the data type but what is way important is the description for each column so you give a very short description like for example here the gender of the customer now one of the best practices of describing a column is to give examples because you can understand quickly the purpose of the columns by just seeing an example right so here we are seeing we can find inside it a male female and not available so with that the consumer of your table can immediately understand uhhuh it will not be an M or an F it’s going to be a full friendly value without having them to go and query the content of the table they can understand quickly the purpose of the column so with that we have a full description for all the columns of our Dimension the same thing we’re going to do for the products so again a description for the table and as well a description for each column and the same thing for the facts so that’s it with that you have like data catalog for your data product at the code layer and with that the business user or the data analyst have better and clear understanding of the content of your gold layer all right my friends so that’s all for the data catalog in The Next Step we’re going to go back to Dro where we’re going to finalize the data flow diagram so let’s go okay so now we’re going to go and extend our data flow diagram but this time for the gold layer so now let’s go and copy the whole thing from the silver layer and put it over here side by side and of course we’re going to go and change the coloring to the gold and now we’re going to go and rename stuff so this is the gold layer but now of course we cannot leave those tables like this we have completely new data model so what do we have over here we have the fact sales we have dimension customers and as well we have Dimension products so now what I’m going to do I’m going to go and remove all those stuff we have only three tables and let’s go and put those three tables somewhere here in the center so now what you have to do is to go and start connecting those stuff I’m going to go with this Arrow over here direct connection and start connecting stuff so the sales details goes to the fact table maybe put the fact table over here and then we have the dimension customer this comes from the CRM customer our info and we have two tables from the Erp it comes from this table as well and the location from the Erp now the same thing goes for the products it comes from the product info and comes from the categories from the Erp now as you can see here we have cross arrows so what we going to do we can go and select everything and we can say line jumps with a gap and this makes it a little bit like Pitter individual for the arrows so now for example if someone asks you where the data come from for the dimension products you can open this diagram and tell them okay this comes from the silver layer we have like two tables the product info from the CRM and as well the categories from the Erp and those server tables comes from the pron layer and you can see the product info comes from the CRM and the category comes from the Erp so it is very simple we have just created a full data lineage for our data warehouse from the sources into the different layers in our data warehouse and data lineage is is really amazing documentation that’s going help not only your users but as well the developers all right so with that we have very nice data flow diagram and a data lineage all right so we have completed the data flow it’s really feel like progress like achievement as we are clicking through all those tasks and now we come to the last task in building the data warehouse where we’re going to go and commit our work in the get repo okay so now let’s put our scripts in the project so we’re going to go to the scripts over here we have here bronze silver but we don’t have a gold so let’s go and create a new file we’re going to have gold/ and then we’re going to say ddl gold. SQL so now we’re going to go and paste our views so we have here our three views and as usual at the start we going to describe the purpose of the views so we are saying create gold views this script can go and create views for the code layer and the code layer represent the final Dimension and fact tables the star schema each view perform Transformations and combination data from the server layer to produce business ready data sets and those us can be used for analytics and Reporting so that it let’s go and commit it okay so with that as you can see we have the PRS the silver so we have all our etls and scripts in the reposter and now as well for the gold layer we’re going to go and add all those quality checks that we have used in order to validate the dimensions and facts so we’re going to go to The Taste over here and we’re going to go and create a new file it’s going to be quality checks gold and the file type is SQL so now let’s go and paste our quality checks so we have the check for the fact the two dimensions and as well an explanation about the script so we are validating the integrity and the accuracy of the gold layer and here we are checking the uniqueness of the circuit keys and whether we are able to connect the data model so let’s put that as well in our git and commit the changes and in case we come up with a new quality checks we’re going to go and add it to our script here so those checks are really important if you are modifying the atls or you want to make sure that after each ATL those script SC should run and so on it is like a quality gate to make sure that everything is fine in the gold layer perfect so now we have our code in our repo story okay friends so now what you have to do is to go and finalize the get repo so for example all the documentations that we have created during the projects we can go and upload them in the docs so for example you can see here the data architecture the data flow data integration data model and so on so with that each time you edit those pages you can commit your work and you have likey version of that and another thing that you can do is that you go to the read me like for example over here I have added the project overview some important links and as well the data architecture and a little description of the architecture of course and of course don’t forget to add few words about yourself and important profiles in the different social medias all right my friends so with that we have completed our work and as well closed the last epek building the gold layer and with that we have completed all the faces of building a data warehouse everything is 100% And this feels really nice all right my friends so if you’re still here and you have built with me the data warehouse then I can say I’m really proud of you you have built something really complex and amazing because building a data warehouse is usually a very complex data projects and with that you have not only learned SQL but you have learned as well how we do a complex data projects in real world so with that you have a real knowledge and as well amazing portfolio that you can share with others if you are applying for a job or if you are showcase that you have learned something new and with that you have experienced different rules in the project what the data Architects and the data Engineers do in complex data projects so that was really an amazing journey even for me as I’m creating this project so now in the next and with that you have done the first type of data analytics projects using SQL the data warehousing now in The Next Step we’re going to do another type of projects the exploratory data analyzes Eda where we’re going to understand and explore our data sets if you like this video and you want me to create more content like this I’m going to really appreciate it if you support the channel by subscribing liking sharing commenting all those stuff going to help the Channel with the YouTube algorithm and as well my content going to reach to the others so thank you so much for watching and I will see you in the next tutorial bye

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog