Category: Data Analyst

  • Beginning Oracle Database 12c Administration

    Beginning Oracle Database 12c Administration

    This book, “Beginning Oracle Database 12c Administration, 2nd Edition,” is a comprehensive guide to Oracle database administration. It covers fundamental database conceptsSQL and PL/SQLOracle architecture, and essential administrative tasks such as user management, data loading, backups, and recovery. The text also emphasizes practical work practices and problem-solving methodologies, including the importance of proper planning and licensing. Finally, it highlights the broader IT context of database administration, emphasizing communication and the role of the DBA within an organization.

    Oracle Database Administration Study Guide

    SQL and PL/SQL

    Subqueries

    A subquery is a SELECT statement that is embedded within another DML statement (SELECT, INSERT, UPDATE, or DELETE) or within another subquery. Subqueries are always enclosed in parentheses and can return a single value, a single row, or multiple rows of data.

    There are three main types of subqueries:

    1. Inline view: This type of subquery appears in the FROM clause of a SELECT statement. It acts like a temporary table, allowing you to select from the results of the subquery.
    2. Scalar subquery: This type of subquery returns exactly one data item from one row. It can be used wherever a single value is expected, such as in a SELECT list, a WHERE clause, or a HAVING clause.
    3. Correlated subquery: This type of subquery depends on the outer query for its values. It is executed repeatedly, once for each row processed by the outer query.

    Types of SQL

    SQL is a powerful language for managing and manipulating relational databases. It is divided into two main categories:

    1. Data Manipulation Language (DML): Used to retrieve, insert, update, and delete data in a database.
    • SELECT: Retrieves data from one or more tables
    • INSERT: Adds new rows into a table
    • UPDATE: Modifies existing data in a table
    • MERGE: Combines INSERT and UPDATE operations based on a condition
    • DELETE: Removes rows from a table
    1. Data Definition Language (DDL): Used to define the structure of the database, including creating, altering, and dropping database objects like tables, views, indexes, and users.
    • CREATE: Creates a new database object
    • ALTER: Modifies the structure of an existing object
    • DROP: Removes an existing object

    Railroad Diagrams

    Oracle uses railroad diagrams to illustrate the syntax of SQL commands. These diagrams provide a visual representation of the different clauses and options available for each command, showing both mandatory and optional elements.

    Database Architecture

    Data Files

    Data files are the physical files that store the actual data of an Oracle database. They are organized into logical units called tablespaces.

    Key points about data files:

    • Each data file belongs to one tablespace.
    • Data files are typically named with a descriptive name and a .dbf or .ora extension.
    • Space within data files is divided into data blocks, also called pages.
    • Each data block contains data from only one table.
    • A contiguous range of data blocks allocated to a table is called an extent.

    Server Processes

    Oracle uses server processes to manage connections and execute user requests. There are two main types of server architectures:

    1. Dedicated Server Architecture: A dedicated server process is created for each user connection. This process handles all requests from the connected user.
    2. Multithreaded Server (MTS) Architecture: A pool of shared server processes is used to handle user connections. Dispatcher processes route user requests to available shared servers. MTS is less commonly used than the dedicated server architecture.

    Software Installation

    The software installation process involves setting up the operating system environment, installing the Oracle software, and configuring the listener.

    Key considerations:

    • Setting up appropriate user accounts and permissions
    • Configuring the network listener to allow client connections
    • Setting up firewalls to secure the database server

    Database Creation

    The Database Configuration Assistant (DBCA) is a graphical tool that simplifies the process of creating and configuring an Oracle database.

    Key parameters:

    • db_block_size: Specifies the size of data blocks
    • db_name: Defines the name of the database
    • db_recovery_file_dest: Sets the location for recovery files
    • memory_target: Sets the total amount of memory allocated to the SGA and PGA
    • processes: Defines the maximum number of processes that can connect to the database

    Physical Database Design

    Physical database design focuses on the efficient storage and retrieval of data within the database.

    Partitioning

    Partitioning is a technique for dividing large tables and indexes into smaller, more manageable pieces called partitions.

    Types of partitioning:

    • List partitioning: Divides data based on a list of discrete values.
    • Range partitioning: Divides data based on ranges of values.
    • Interval partitioning: Automatically creates new partitions based on specified intervals.
    • Hash partitioning: Distributes data randomly across partitions using a hashing function.
    • Reference partitioning: Partitions a child table based on the partitioning scheme of its parent table.
    • Composite partitioning: Combines different partitioning methods to create subpartitions within a partition.

    Partition Views

    Partition views combine data from multiple partitioned tables to present a unified view of the data to the user. They provide transparency to the user, hiding the underlying partitioning scheme.

    User Management and Data Loading

    User Management

    Key commands for managing user accounts:

    • CREATE USER: Creates a new user account in the database.
    • ALTER USER: Modifies an existing user account, such as changing passwords, assigning quotas, or setting default and temporary tablespaces.
    • DROP USER: Removes a user account from the database.
    • GRANT: Assigns privileges to a user, allowing them to perform specific actions in the database.
    • REVOKE: Removes privileges from a user.

    Data Loading

    Key methods for loading data into an Oracle database:

    • Data Pump: A high-speed utility for exporting and importing data. The expdp and impdp commands provide a wide range of options for controlling the data loading process.
    • Export/Import: An older utility for data loading. The exp and imp commands are still available but are less efficient than Data Pump.
    • SQL*Loader: A command-line utility for loading data from external files. It uses a control file to define the format of the input data and map it to the database columns.

    Quiz

    Instructions: Answer the following questions in 2-3 sentences each.

    1. What are the three main types of subqueries, and how do they differ?
    2. Explain the difference between DML and DDL and provide examples of each.
    3. How do railroad diagrams help in understanding SQL syntax?
    4. What are data blocks and extents in the context of data files?
    5. Compare and contrast the dedicated server and multithreaded server architectures.
    6. What are some key considerations during the software installation process for Oracle Database?
    7. Explain the concept of database partitioning and list at least three different partitioning methods.
    8. What is the purpose of a partition view?
    9. Describe the steps involved in creating a new user account and granting them privileges to access database objects.
    10. List and briefly explain three different methods for loading data into an Oracle database.

    Answer Key

    1. The three main types of subqueries are inline views, scalar subqueries, and correlated subqueries. Inline views act like temporary tables in the FROM clause, scalar subqueries return a single value, and correlated subqueries depend on the outer query for their values.
    2. DML (Data Manipulation Language) is used for manipulating data within a database, while DDL (Data Definition Language) is used for defining the database structure. Examples of DML include SELECT, INSERT, UPDATE, and DELETE, while examples of DDL include CREATE, ALTER, and DROP.
    3. Railroad diagrams provide a visual representation of the syntax of SQL commands, showing both mandatory and optional elements. They help to understand the order and relationships between different clauses and options.
    4. Data blocks (also called pages) are the units of storage within data files, with a fixed size. Extents are contiguous ranges of data blocks allocated to a specific table.
    5. A dedicated server architecture assigns a separate process to each user connection, while a multithreaded server (MTS) architecture uses a pool of shared server processes to handle multiple connections. MTS can be more efficient for handling many concurrent connections but is less commonly used than the dedicated server architecture.
    6. Key considerations during Oracle Database software installation include setting up appropriate user accounts and permissions, configuring the network listener, and setting up firewalls. These steps ensure security and allow clients to connect to the database server.
    7. Database partitioning involves dividing large tables and indexes into smaller pieces called partitions. This improves manageability and performance. Different partitioning methods include list partitioning (based on discrete values), range partitioning (based on value ranges), and hash partitioning (based on a hashing function).
    8. A partition view combines data from multiple partitioned tables into a single logical view. This allows users to query the data transparently without needing to know about the underlying partitioning scheme.
    9. To create a new user account, use the CREATE USER command, specifying a username and password. Use the GRANT command to assign privileges to the user, allowing them to perform actions like creating tables, selecting data, or modifying data.
    10. Three methods for loading data into Oracle Database are Data Pump (using expdp and impdp commands), Export/Import (using exp and imp commands), and SQL*Loader (using a control file to define the data format). Data Pump is the most efficient method for large datasets.

    Essay Questions

    1. Discuss the advantages and disadvantages of using different partitioning methods in Oracle Database. Provide real-world scenarios where each method would be most appropriate.
    2. Explain the concept of read consistency in Oracle Database. How is it achieved, and what are its benefits and limitations?
    3. Describe the different types of database backups available in Oracle Database. Discuss best practices for implementing a comprehensive backup and recovery strategy.
    4. Explain the importance of database monitoring and performance tuning. Describe the tools and techniques available in Oracle Database for monitoring performance and identifying bottlenecks.
    5. Discuss the role of the Oracle Data Dictionary in database administration. How can the Data Dictionary be used to obtain information about database objects, users, and privileges?

    Glossary of Key Terms

    • Data Block: The fundamental unit of storage within an Oracle data file, with a fixed size. Also called a page.
    • Extent: A contiguous range of data blocks allocated to a table or index.
    • Tablespace: A logical grouping of data files. Tablespaces help to organize and manage database storage.
    • Dedicated Server Process: A server process dedicated to handling requests from a single user connection.
    • Multithreaded Server (MTS): A server architecture that uses a pool of shared server processes to handle multiple user connections.
    • Partitioning: A technique for dividing large tables and indexes into smaller, more manageable pieces called partitions.
    • Partition View: A logical view that combines data from multiple partitioned tables, providing a unified view of the data.
    • Data Pump: A high-speed utility for exporting and importing data in Oracle Database.
    • SQL*Loader: A command-line utility for loading data into Oracle Database from external files.
    • Read Consistency: A feature of Oracle Database that ensures that all data read during a transaction is consistent with the state of the database when the transaction started.
    • Data Dictionary: A collection of metadata tables and views that store information about the structure and contents of an Oracle database.
    • System Global Area (SGA): A shared memory area used by all Oracle processes to store database data and control information.
    • Program Global Area (PGA): A private memory area allocated to each Oracle server process for its own use.
    • SQL Tuning Advisor: A tool that analyzes SQL statements and recommends changes to improve their performance.
    • Automatic Workload Repository (AWR): A repository that stores historical performance data about an Oracle database.
    • Statspack: An older tool that collects and reports performance statistics for Oracle databases.
    • Wait Interface: A set of dynamic performance views that provide information about the wait events experienced by Oracle processes.

    Briefing Document: Oracle Database 12c Administration

    This document reviews key themes and insights from excerpts of “Beginning Oracle Database 12c Administration, 2nd Edition,” focusing on database architecture, administration, maintenance, and tuning.

    I. Database Architecture

    • Data Storage: Oracle databases utilize data files organized into tablespaces. Data within these files is structured into equal-sized data blocks, typically 8KB. An extent is a contiguous range of data blocks allocated to a table when it requires more space.
    • “The space within data files is organized into data blocks (sometimes called pages) of equal size… Each block contains data from just one table… When a table needs more space, it grabs a contiguous range of data blocks called an extent” (Chapter 2).
    • Server Processes: Oracle employs a dedicated server process for each user connection. This process handles tasks like permission checks, query plan generation, and data retrieval.
    • “A dedicated server process is typically started whenever a user connects to the database—it performs all the work requested by the user” (Chapter 2).
    • Memory Structures: The System Global Area (SGA) is a shared memory region crucial for database operations. It includes the database buffer cache for storing frequently accessed data blocks, the redo log buffer for transaction logging, and the shared pool for storing parsed SQL statements and execution plans.
    • Background Processes: Essential for database functionality, background processes include:
    • DBWn (Database Writer): Writes modified data blocks from the buffer cache to data files.
    • LGWR (Log Writer): Writes redo log entries from the redo log buffer to redo log files.
    • CKPT (Checkpoint): Synchronizes data files and control files with the database’s current state.
    • SMON (System Monitor): Performs instance recovery after a system crash and coalesces free space in tablespaces.

    II. Database Administration

    • SQL Language: Oracle utilizes SQL for both data manipulation (DML) and data definition (DDL). Railroad diagrams, often recursive, are used to explain the syntax and structure of SQL statements. Subqueries, particularly inline views and scalar subqueries, play significant roles in complex queries.
    • User Management: The CREATE USER statement creates new users, defining their authentication, default and temporary tablespaces, and initial profile. ALTER USER modifies user attributes like passwords and tablespace quotas. GRANT and REVOKE commands control access privileges on database objects.
    • “The CREATE USER statement should typically specify a value for DEFAULT TABLESPACE… and TEMPORARY TABLESPACE” (Chapter 8).
    • Data Loading: Oracle provides several methods for importing data:
    • SQL*Loader: A powerful utility for loading data from external files.
    • Data Pump Export (expdp) and Import (impdp): Introduced in Oracle 10g, these utilities offer features like parallelism, compression, and encryption for efficient data transfer.

    III. Physical Database Design

    • Partitioning: A technique for dividing large tables into smaller, manageable pieces. Different partitioning strategies include range, list, hash, composite, and reference partitioning. Partitioning enhances query performance, backup and recovery, and data management.
    • Indexes: Data structures that speed up data retrieval. B*tree indexes are commonly used in OLTP environments, while bitmap indexes are suitable for data warehousing.
    • “Most indexes are of the btree (balanced tree) type and are best suited for online transaction-processing environments”* (Chapter 17).

    IV. Database Maintenance

    • Backups: Regular backups are vital for data protection and recovery. RMAN (Recovery Manager) is Oracle’s recommended tool for performing backups and managing backup sets. Strategies include full, incremental, and cumulative backups.
    • Recovery: Techniques for restoring a database to a consistent state after failures. Options include:
    • Data Recovery Advisor (DRA): An automated tool for diagnosing and repairing database corruption.
    • Flashback Technologies: Allow for quick recovery from logical errors or unintentional data modifications.
    • LogMiner: Enables analysis of archived redo logs to recover specific data changes.
    • Space Management: Monitoring tablespace usage and free space is crucial. Techniques like segment shrinking and coalescing free space can help optimize storage utilization.

    V. Database Tuning

    • Performance Monitoring: Tools like Statspack, AWR (Automatic Workload Repository), and dynamic performance views provide insights into database performance.
    • Statspack: Collects performance snapshots for analysis.
    • “Note that Statspack is not documented in the reference guides for Oracle Database 10g, 11g, and 12c, even though it has been upgraded for all these versions” (Chapter 16).
    • AWR: A more comprehensive and automated performance monitoring framework.
    • SQL Tuning: Identifying and optimizing inefficient SQL statements is crucial for improving overall database performance. Techniques include index creation and tuning, hint usage, and utilizing the SQL Tuning Advisor.
    • Wait Interface: Analyzing wait events helps pinpoint performance bottlenecks. Common wait events like db file sequential read and log file sync provide clues for optimization.

    VI. Key Takeaways

    • Understanding Oracle’s architectural components is fundamental for effective administration.
    • Proper planning for licensing, hardware sizing, and configuration is essential for a successful deployment.
    • Regular maintenance tasks like backups, recovery drills, and space management ensure database health and data integrity.
    • Proactive performance monitoring and SQL tuning are critical for achieving optimal database performance.
    • Utilizing Oracle’s various tools and features like RMAN, Data Pump, and the SQL Tuning Advisor simplifies administrative tasks and enhances efficiency.

    Oracle Database Administration FAQ

    What are the different types of subqueries in Oracle SQL?

    There are three main types of subqueries:

    • Inline views: These are subqueries used in the FROM clause as a table reference. They act like temporary views within a larger query.
    • Scalar subqueries: These subqueries return a single value and can be used wherever a single value is expected, such as in a SELECT list or WHERE clause.
    • Correlated subqueries: These subqueries depend on values from the outer query and are executed repeatedly for each row of the outer query.

    How is space organized within Oracle data files?

    Space in data files is structured in data blocks, also known as pages. Each data file has a fixed block size (e.g., 8KB) defined at the tablespace level. A block holds data for a single table. To accommodate growth, tables claim a contiguous series of data blocks, forming an extent.

    What are the main types of server processes in Oracle?

    Oracle primarily uses two types of server processes:

    • Dedicated server processes: A dedicated server process handles requests for a single user connection. This is the typical model.
    • Shared server processes (Multithreaded Server – MTS): In this model, a pool of shared server processes handles requests from multiple users. This approach can be more efficient for environments with many concurrent but mostly idle connections.

    What are the different types of partitioning available in Oracle?

    Oracle offers several partitioning methods:

    • Range partitioning: Data is divided into partitions based on a range of values for a specific column, typically a date or number.
    • List partitioning: Partitions are created based on lists of discrete values for a specific column.
    • Hash partitioning: A hashing function distributes data across partitions, aiming for even data distribution.
    • Interval partitioning: This is an extension of range partitioning where new partitions are automatically created based on a defined interval.
    • Reference partitioning: This method partitions a child table based on the partitioning key of a referenced parent table.
    • Composite partitioning: This approach combines multiple partitioning methods, allowing for partitions to be further divided into subpartitions.

    How can I export and import data in Oracle?

    Oracle provides multiple utilities for data export and import:

    • Data Pump (expdp and impdp): This is the preferred method in modern Oracle versions, offering features like parallelism, compression, and encryption.
    • Original Export/Import (exp and imp): Although less commonly used now, these utilities are still available and offer various options for data export and import.
    • SQL*Loader: This utility loads data from external files into Oracle tables, using a control file to define the data format and loading rules.

    What is the purpose of the Oracle Data Dictionary?

    The Data Dictionary is a collection of metadata tables and views containing information about the structure and objects within an Oracle database. It stores details about tables, indexes, users, privileges, and other database components. It is crucial for understanding the database’s structure and troubleshooting issues.

    What are some tools for monitoring an Oracle database?

    Several tools help monitor an Oracle database:

    • Oracle Enterprise Manager: A comprehensive suite with web-based interfaces for monitoring and managing various aspects of the database.
    • Statspack: A lightweight performance monitoring tool capturing snapshots of database activity for analysis.
    • Automatic Workload Repository (AWR): Built into the database, AWR automatically collects performance data and generates reports.
    • Dynamic Performance Views: Real-time views providing detailed information about database activity.
    • Third-party tools: Tools like Toad and DBArtisan provide extensive monitoring and management features.

    What are some techniques for tuning SQL queries in Oracle?

    Effective SQL tuning involves a multi-faceted approach:

    • Understanding the Execution Plan: Analyze the query plan to identify bottlenecks and areas for optimization.
    • Using Indexes Appropriately: Create and utilize indexes effectively to speed up data retrieval.
    • Rewriting Queries for Efficiency: Optimize query structure, consider using hints, and avoid unnecessary operations.
    • Collecting Statistics: Ensure up-to-date statistics are available for the optimizer to make informed decisions.
    • Using the SQL Tuning Advisor: Employ the advisor to identify and implement potential optimizations.
    • Considering Materialized Views: Pre-calculate and store query results to improve performance for frequently used complex queries.

    Oracle 12c Database Administration

    Timeline of Events:

    This text excerpt does not present a narrative with a sequence of events. Instead, it offers technical information and instructions related to Oracle Database 12c administration. The provided content focuses on aspects like:

    • SQL fundamentals: Introduction to SQL language, different types of SQL statements (DML and DDL), and the use of railroad diagrams for understanding SQL syntax.
    • Database Structure: Explanation of data files, tablespaces, data blocks, and extents within Oracle databases.
    • Server Processes: Description of dedicated server processes and the multithreaded server model.
    • Software Installation: Instructions for software installation including setting up iptables firewall rules.
    • Database Creation: Details about setting database parameters, data files, and tablespace sizes during database creation.
    • Physical Database Design: Exploration of different partitioning techniques like list, range, interval, hash, reference, and composite partitioning for efficient data organization.
    • User Management and Data Loading: Guidance on user creation, granting and revoking privileges, managing tablespaces, and using utilities like exp/imp and expdp/impdp for data loading and export.
    • Database Support: Introduction to data dictionary views and their importance in database administration, and brief mention of third-party tools.
    • Monitoring: Overview of monitoring database activity through alert logs, checking CPU and load average, understanding listener issues, and using tools like AWR and Statspack for performance monitoring.
    • Fixing Problems: Troubleshooting scenarios related to unresponsive listeners and data corruption using tools like DRA and RMAN.
    • Database Maintenance: Tasks like archiving, auditing, backups, purging, rebuilding, statistics gathering, and user management as part of regular database maintenance.
    • SQL Tuning: Understanding the role of indexes, interpreting query execution plans, and utilizing tools like SQL Tuning Advisor for optimizing SQL statement performance.

    Therefore, it’s not feasible to create a timeline based on the provided content.

    Cast of Characters:

    This technical text excerpt doesn’t feature individual characters in a narrative sense. It primarily focuses on technical concepts and instructions related to Oracle Database 12c administration.

    However, we can identify some key entities mentioned:

    • Oracle: The company developing and providing the Oracle Database software.
    • DBA (Database Administrator): The individual responsible for managing and maintaining the Oracle database.
    • Users: Individuals accessing and utilizing the Oracle database. Specific users like “ifernand,” “hr,” and “clerical_role” are mentioned as examples in user management and data loading sections.

    Instead of character bios, we can highlight their roles:

    • Oracle: Provides the software, documentation, and support for Oracle Database.
    • DBA: Performs tasks like installation, configuration, security management, performance tuning, backup and recovery, and user management.
    • Users: Utilize the database for various purposes, depending on their assigned roles and privileges.

    This information clarifies the roles of entities involved in Oracle database administration, even though traditional character bios are not applicable in this context.

    Oracle Database Administration

    The most concrete aspect of a database is the files on the storage disks connected to the database host [1]. The location of the database software is called the Oracle home [1]. The path to that location is usually stored in the environment variable ORACLE_HOME [1]. There are two types of database software: server and client software [1]. Server software is necessary to create and manage the database and is required only on the database host [1]. **Client software is necessary to utilize the database and is required on every user’s computer. The most common example is the SQL*Plus command-line tool** [1].

    Well-known configuration files include init.ora, listener.ora, and tnsnames.ora [2]. Data files are logically grouped into tablespaces [2]. Each Oracle table or index is assigned to one tablespace and shares the space with other tables assigned to the same tablespace [2]. Data files can grow automatically if the database administrator wishes [2]. The space within data files is organized into equally sized blocks; all data files belonging to a tablespace use the same block size [2]. When a data table needs more space, it grabs a contiguous range of data blocks called an extent [2]. It is conventional to use the same extent size for all tables in a tablespace [2].

    Oracle records important events and errors in the alert log [3]. A detailed trace file is created when a severe error occurs [3]. Oracle Database administrators need to understand SQL in all its forms [4]. All database activity, including database administration activities, is transacted in SQL [4]. Oracle reference works use railroad diagrams to teach the SQL language [5]. SQL is divided into Data Manipulation Language (DML) and Data Definition Language (DDL) [5]. DML includes the SELECT, INSERT, UPDATE, MERGE, and DELETE statements [5]. DDL includes the CREATE, ALTER, and DROP statements for the different classes of objects in an Oracle database [5]. The SQL reference manual also describes commands that can be used to perform database administration activities such as stopping and starting databases [5].

    Programs written in PL/SQL can be stored in an Oracle database [6]. Using these programs has many advantages, including efficiency, control, and flexibility [6]. PL/SQL offers a full complement of structured programming mechanisms such as condition checking, loops, and subroutines [6].

    When you stop thinking in terms of command-line syntax such as create database and GUI tools such as the Database Creation Assistant (dbca) and start thinking in terms such as:

    • security management
    • availability management
    • continuity management
    • change management
    • incident management
    • problem management
    • configuration management
    • release management
    • and capacity management,

    the business of database administration begins to make coherent sense, and you become a more effective database administrator [7]. These terms are part of the standard jargon of the IT Infrastructure Library (ITIL), a suite of best practices used by IT organizations throughout the world [7].

    Every object in a database is explicitly owned by a single owner, and the owner of an object must explicitly authorize its use by anybody else. The collection of objects owned by a user is called a schema [8, 9]. The terms user, schema, schema owner, and account are used interchangeably [8].

    A database is an information repository that must be competently administered using the principles laid out in the IT Infrastructure Library (ITIL), including:

    • security management
    • availability management
    • continuity management
    • change management
    • incident management
    • problem management
    • configuration management
    • release management
    • and capacity management [10].

    The five commands required for user management are CREATE USER, ALTER USER, DROP USER, GRANT, and REVOKE [9].

    Form-based tools also simplify the task of database administration [11]. A workman is as good as his tools [11].

    Enterprise Manager comes in two flavors: Database Express and Cloud Control. Both are web-based tools. Database Express is used to manage a single database, whereas Grid Control is used to manage multiple databases [12]. You can accomplish most DBA tasks—from mundane tasks such as password resets and creating indexes to complex tasks such as backup and recovery—by using Enterprise Manager instead of command-line tools such as SQL*Plus [12].

    SQL Developer is primarily a tool for software developers, but database administrators will find it very useful. Common uses are examining the structure of a table and checking the execution plan for a query [13]. It can also be used to perform some typical database administration tasks such as identifying and terminating blocking sessions [13].

    Remote Diagnostic Agent (RDA) is a tool provided by Oracle Support to collect information about a database and its host system. RDA organizes the information it gathers into an HTML framework for easy viewing [13]. It is a wonderful way to document all aspects of a database system [13].

    Oracle stores database metadata—data about data—in tables, just as in the case of user data. This collection of tables is called the data dictionary. The information in the data dictionary tables is very cryptic and condensed for maximum efficiency during database operation. The data dictionary views are provided to make the information more comprehensible to the database administrator [14].

    The alert log contains error messages and informational messages. The location of the alert log is listed in the V$DIAG_INFO view. The name of the alert log is alert_SID.log, where SID is the name of your database instance [15]. Enterprise Manager monitors the database and sends e-mail messages when problems are detected [16]. The command AUDIT ALL enables auditing for a wide variety of actions that modify the database and objects in it, such as ALTER SYSTEM, ALTER TABLESPACE, ALTER TABLE, and ALTER INDEX [16]. The AUDIT CREATE SESSION command causes all connections and disconnections to be recorded [16]. Recovery Manager (RMAN) maintains detailed history information about backups. RMAN commands such as list backup, report need backup, and report unrecoverable can be used to review backups. Enterprise Manager can also be used to review backups [16].

    Database maintenance is required to keep the database in peak operating condition. Most aspects of database maintenance can be automated. Oracle performs some maintenance automatically: collecting statistics for the query optimizer to use [17].

    Competency in Oracle technology is only half of the challenge of being a DBA. If you had very little knowledge of Oracle technology but knew exactly “which” needed to be done, you could always find out how to do it—there is Google, and there are online manuals aplenty [18]. Too many Oracle DBAs don’t know “which” to do, and what they have when they are through is “just a mess without a clue” [18].

    Any database administration task that is done repeatedly should be codified into an SOP. Using a written SOP has many benefits, including efficiency, quality, and consistency [19].

    The free Oracle Database 12c Performance Tuning Guide offers a detailed and comprehensive treatment of performance-tuning methods [20].

    Perhaps the most complex problem in database administration is SQL tuning. The paucity of books devoted to SQL tuning is perhaps further evidence of the difficulty of the topic [21]. The only way to interact with Oracle, to retrieve data, to change data, and to administer the database is SQL [21]. Oracle itself uses SQL to perform all the work that it does behind the scenes. SQL performance is, therefore, the key to database performance; all database performance problems are really SQL performance problems, even if they express themselves as contention for resources [21].

    Relational Databases and SQL

    A relational database is a database in which the data is perceived by the user as tables, and the operators available to the user are operators that generate “new” tables from “old” ones. [1] Relational database theory was developed as an alternative to the “programmer as navigator” paradigm prevalent in pre-relational databases. [2] In these databases, records were connected using pointers. To access data, you would have to navigate to a specific record and then follow a chain of records. [2] This approach required programmers to be aware of the database’s physical structure, which made applications difficult to develop and maintain. [3]

    Relational databases address these problems by using relational algebra, a collection of operations used to combine tables. [4] These operations include:

    • Selection: Creating a new table by extracting a subset of rows from a table based on specific criteria. [5]
    • Projection: Creating a new table by extracting a subset of columns from a table. [5]
    • Union: Creating a new table by combining all rows from two tables. [5]
    • Difference: Creating a new table by extracting rows from one table that do not exist in another table. [6]
    • Join: Creating a new table by concatenating records from two tables. [6]

    One of the significant advantages of relational databases is that they allow users to interact with the data without needing to know the database’s physical structure. [3] The database management system is responsible for determining the most efficient way to execute queries. [7] This separation between the logical and physical aspects of the database is known as physical data independence. [8]

    SQL (Structured Query Language) is the standard language used to interact with relational databases. [9] SQL allows users to perform various operations, including:

    • Retrieving data.
    • Inserting, updating, and deleting data.
    • Managing database objects such as tables and indexes.

    Despite its widespread adoption, SQL has been criticized for some of its features, including the allowance of duplicate rows and the use of nullable data items. [10, 11] However, SQL remains the most widely used language for interacting with relational databases, and it is an essential skill for database administrators. [11]

    SQL and PL/SQL in Oracle Databases

    SQL (Structured Query Language) is the primary language used to interact with Oracle databases, encompassing all database activities, including administration. [1] Database administrators need to be well-versed in SQL due to its extensive capabilities and functionalities. [1] The significance of SQL is evident in the sheer volume of the Oracle Database 12c SQL Language Reference, which spans nearly 2,000 pages. [1]

    SQL offers a powerful set of features, including:

    • Data Manipulation Language (DML): This subset of SQL focuses on modifying data within the database. DML statements include SELECT, INSERT, UPDATE, MERGE, and DELETE. [2, 3]
    • Data Definition Language (DDL): DDL statements handle the creation, modification, and removal of database objects, such as tables and indexes. Common DDL statements include CREATE, ALTER, and DROP. [2, 4]

    Oracle’s reference manuals utilize railroad diagrams to illustrate the syntax and numerous optional clauses of SQL statements. [5] These diagrams provide a visual representation of the structure and flow of SQL commands. [5] A notable aspect of railroad diagrams is their ability to incorporate subdiagrams and even reference themselves recursively, adding to the complexity and power of SQL. [6]

    PL/SQL (Procedural Language/SQL) extends the capabilities of SQL by providing procedural programming constructs within the Oracle database. [7] PL/SQL empowers developers to create sophisticated programs that interact with the database, leveraging features such as:

    • Condition checking: Implementing decision-making logic within PL/SQL programs. [7]
    • Loops: Enabling repetitive execution of code blocks for efficient processing. [7]
    • Subroutines: Encapsulating reusable code segments for modularity and code organization. [7]

    One of the prominent applications of PL/SQL is the creation of triggers, which automatically execute predefined actions in response to specific database events. [7] For instance, the HR schema employs a trigger to log historical job changes whenever the job_id in the employees table is modified. [8] Triggers enhance data integrity, security, and auditing capabilities within the database. [9]

    Storing PL/SQL programs within the database offers several advantages, such as:

    • Enhanced efficiency: Reduced communication overhead between client and server, resulting in improved performance. [9]
    • Improved control: Streamlined enforcement of business rules through triggers. [9]
    • Increased flexibility: Empowering SQL statements with the added power and versatility of PL/SQL functions. [9]

    The combined capabilities of SQL and PL/SQL make them essential tools for Oracle database administrators, enabling them to manage data, enforce rules, and optimize database operations effectively.

    Database Backup and Recovery Strategies

    Database backups are crucial for protecting against data loss due to user error, operator error, or hardware failure. Backups are essentially snapshots of a database or a portion of a database taken at a specific point in time. If a database is damaged, these backups can be used to restore it to a functional state. Additionally, archived logs, which contain records of all transactions performed on the database, can be used in conjunction with backups to replay modifications made after the backup was created, ensuring a complete recovery. [1]

    Determining the appropriate backup strategy requires careful consideration of various factors, including the business needs, cost-effectiveness, and available resources. Several key decisions need to be made: [2]

    • Storage Medium: Backups can be stored on tape or disk. Tapes offer advantages in terms of cost and reliability, while disks provide faster access and ease of management. A common approach is to create backups on disks initially and then copy them to tapes for long-term storage. [2-4]
    • Backup Scope: Full backups capture the entire database, while partial backups focus on specific portions, such as changed data blocks or read-only tablespaces. [5]
    • Backup Level: Level 0 backups are full backups, while level 1 backups, also known as incremental backups, only include data blocks that have changed since the last level 0 backup. This approach balances backup frequency with resource consumption. [6]
    • Backup Type: Physical backups create exact copies of data blocks and files, while logical backups represent a structured copy of table data. Logical backups are generally smaller but cannot be used to restore the entire database. [7]
    • Backup Consistency: Consistent backups guarantee a point-in-time representation of the database, while inconsistent backups may contain inconsistencies due to ongoing modifications during the backup process. The use of redo logs can address inconsistencies in physical backups. [8]
    • Backup Mode: Hot backups, or online backups, allow database access and modifications during the backup operation, while cold backups, or offline backups, require the database to be unavailable. [9]
    • Backup Management: Oracle-managed backups utilize Recovery Manager (RMAN), which offers numerous advantages such as ease of use, history data storage, and advanced features like incremental backups and corruption detection. User-managed backups employ alternative methods, such as snapshot technology, which can be integrated with RMAN for enhanced capabilities. [10-12]

    Recovery, the process of repairing a damaged database, often follows a restore operation, which involves replacing damaged or missing files from backup copies. Different types of recovery cater to specific situations: [13, 14]

    • Full Recovery: Restoring the entire database to a functional state. [14]
    • Partial Recovery: Repairing only the affected parts of the database without impacting the availability of other parts. [14]
    • Complete Recovery: Recovering all transactions up to the latest point in time. [15]
    • Incomplete Recovery: Intentionally stopping the recovery process at a specific point in time, often used to reverse user errors. [15]
    • Traditional Recovery: Using archived redo logs to replay transactions. [16]
    • Flashback Recovery: Utilizing flashback logs to quickly unwind transactions, offering faster recovery times than traditional methods. [16]

    Data Recovery Advisor (DRA) simplifies the database repair process by automating tasks and providing recommendations. By analyzing failures and generating RMAN scripts, DRA streamlines the recovery process for DBAs. [17]

    Testing recovery procedures is crucial for ensuring their effectiveness and validating backup usability. RMAN offers the DUPLICATE DATABASE command, allowing DBAs to create a copy of the database for testing purposes without affecting the live environment. [18]

    Documenting recovery procedures in standard operating procedures (SOPs) is vital for consistent and efficient execution, especially in stressful situations. SOPs should outline the steps involved in backups, recovery, and other critical database management tasks. [18, 19]

    Database Performance Tuning: A Five-Step Approach

    Database performance tuning is a critical aspect of database administration, aimed at optimizing the database’s efficiency and responsiveness in handling workloads. Tuning involves a systematic approach to identify performance bottlenecks, analyze their root causes, and implement solutions to improve overall performance.

    One of the primary focuses of database tuning is on DB time, which represents the total time the database spends actively working on user requests. Analyzing DB time allows administrators to pinpoint areas where the database is spending excessive time and identify potential bottlenecks. The Statspack and AWR reports provide comprehensive insights into DB time distribution across various database operations, helping to isolate performance issues. [1, 2]

    A widely recognized method for database tuning is the five-step approach, encompassing: [1, 3]

    1. Define the problem: This crucial initial step involves gathering detailed information about the perceived performance issue, including specific symptoms, affected users, and any recent changes in the environment that might have contributed to the problem. Accurately defining the problem sets the foundation for effective investigation and analysis.
    2. Investigate the problem: Once the problem is clearly defined, a thorough investigation is conducted to gather relevant evidence, such as Statspack reports, workload graphs, and session traces. This step aims to delve deeper into the problem’s nature and collect data for analysis.
    3. Analyze the collected data: The evidence collected during the investigation is scrutinized to identify patterns, trends, and potential root causes of the performance issue. For example, examining the “Top 5 Timed Events” section of a Statspack report can reveal specific database operations consuming significant DB time. [4]
    4. Solve the problem: Based on the analysis, solutions are formulated to address the identified performance bottlenecks. This step may involve adjusting database configuration parameters, implementing indexing strategies, optimizing SQL queries, or considering hardware upgrades.
    5. Implement and validate the solution: The proposed solutions are implemented in the database environment, and their impact on performance is carefully monitored and validated. This step ensures the effectiveness of the implemented changes and verifies the desired performance improvements.

    Tools like Statspack and AWR play a crucial role in database performance tuning, providing rich data for analysis and insights into database behavior. These tools offer comprehensive reports, customizable queries, and historical data collection, enabling DBAs to track performance trends over time and identify areas for improvement. [1] SQL Developer, another essential tool, enables DBAs to examine table structures, check the execution plan for queries, and even pinpoint blocking sessions that may be hindering performance. [5, 6]

    Database tuning often involves addressing various factors contributing to performance issues. Some common areas of focus include:

    • I/O Performance: Optimizing disk I/O operations can significantly impact database performance. Techniques may involve using faster disks, configuring RAID arrays for optimal performance, or tuning the database buffer cache to minimize disk reads. [7]
    • Memory Management: Efficient memory allocation and utilization are essential for database performance. Tuning may involve adjusting the sizes of the shared pool, buffer cache, and other memory structures to optimize resource allocation. [8, 9]
    • SQL Tuning: SQL queries that consume excessive resources can significantly degrade performance. SQL tuning involves analyzing query execution plans, identifying inefficient operations, and optimizing queries through techniques like indexing, rewriting queries, or using hints to influence the optimizer’s choices. [10-12]
    • Contention: When multiple processes compete for the same resources, such as locks or latches, performance can suffer. Identifying and resolving contention issues may involve optimizing application logic, tuning concurrency settings, or implementing appropriate locking strategies.
    • Workload Management: Analyzing and managing the database workload can help distribute resources effectively and prioritize critical operations. Techniques may include scheduling resource-intensive tasks during off-peak hours, implementing resource limits, or using features like Oracle Resource Manager to control resource allocation.

    Monitoring database performance is an ongoing process, crucial for detecting and addressing performance degradation proactively. Tools like Enterprise Manager provide dashboards and alerts, enabling DBAs to stay informed about database health and performance metrics. By regularly reviewing performance data and identifying trends, DBAs can anticipate potential issues and take corrective actions before they impact users. [13-15]

    Effective database performance tuning requires a deep understanding of database concepts, available tools, and a systematic approach to problem-solving. By leveraging these elements, DBAs can ensure that databases operate optimally, meeting the performance demands of their users and supporting business objectives.

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    YouTube Video

    SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • Data Science and Machine Learning Foundations

    Data Science and Machine Learning Foundations

    This PDF excerpt details a machine learning foundations course. It covers core concepts like supervised and unsupervised learning, regression and classification models, and essential algorithms. The curriculum also explores practical skills, including Python programming with relevant libraries, natural language processing (NLP), and model evaluation metrics. Several case studies illustrate applying these techniques to various problems, such as house price prediction and customer segmentation. Finally, career advice is offered on navigating the data science job market and building a strong professional portfolio.

    Data Science & Machine Learning Study Guide

    Quiz

    1. How can machine learning improve crop yields for farmers? Machine learning can analyze data to optimize crop yields by monitoring soil health and making decisions about planting, fertilizing, and other practices. This can lead to increased revenue for farmers by improving the efficiency of their operations and reducing costs.
    2. Explain the purpose of the Central Limit Theorem in statistical analysis. The Central Limit Theorem states that the distribution of sample means will approximate a normal distribution as the sample size increases, regardless of the original population distribution. This allows for statistical inference about a population based on sample data.
    3. What is the primary difference between supervised and unsupervised learning? In supervised learning, a model is trained using labeled data to predict outcomes. In unsupervised learning, a model is trained on unlabeled data to find patterns or clusters within the data without a specific target variable.
    4. Name three popular supervised learning algorithms. Three popular supervised learning algorithms are K-Nearest Neighbors (KNN), Decision Trees, and Random Forest. These algorithms are used for both classification and regression tasks.
    5. Explain the concept of “bagging” in machine learning. Bagging, short for bootstrap aggregating, involves training multiple models on different subsets of the training data, and then combining their predictions. This technique reduces variance in predictions and creates a more stable prediction model.
    6. What are two metrics used to evaluate the performance of a regression model? Two metrics used to evaluate regression models include Residual Sum of Squares (RSS) and R-squared. The RSS measures the sum of the squared differences between predicted and actual values, while R-squared quantifies the proportion of variance explained by the model.
    7. Define entropy as it relates to decision trees. In the context of decision trees, entropy measures the impurity or randomness of a data set. A higher entropy value indicates a more mixed class distribution, and decision trees attempt to reduce entropy by splitting data into more pure subsets.
    8. What are dummy variables and why are they used in linear regression? Dummy variables are binary variables (0 or 1) used to represent categorical variables in a regression model. They are used to include categorical data in linear regression without misinterpreting the nature of the categorical variables.
    9. Why is it necessary to split data into training and testing sets? Splitting data into training and testing sets allows for training the model on one subset of data and then evaluating its performance on a different, unseen subset. This prevents overfitting and helps determine how well the model generalizes to new, real-world data.
    10. What is the role of the learning rate in gradient descent? The learning rate (or step size) determines how much the model’s parameters are adjusted during each iteration of gradient descent. A smaller learning rate means smaller steps toward the minimum. A large rate can lead to overshooting or oscillations, and is not the same thing as momentum.

    Answer Key

    1. Machine learning algorithms can analyze data related to crop health and soil conditions to make data-driven recommendations, which allows farmers to optimize their yield and revenue by using resources more effectively.
    2. The Central Limit Theorem is important because it allows data scientists to make inferences about a population by analyzing a sample, and it allows them to understand the distribution of sample means which is a building block to statistical analysis.
    3. Supervised learning uses labeled data with defined inputs and outputs for model training, while unsupervised learning works with unlabeled data to discover structures and patterns without predefined results.
    4. K-Nearest Neighbors, Decision Trees, and Random Forests are some of the most popular supervised learning algorithms. Each can be used for classification or regression problems.
    5. Bagging involves creating multiple training sets using resampling techniques, which allows multiple models to train before their outputs are averaged or voted on. This increases the stability and robustness of the final output.
    6. Residual Sum of Squares (RSS) measures error while R-squared measures goodness of fit.
    7. Entropy in decision trees measures the impurity or disorder of a dataset. The lower the entropy, the more pure the classification for a given subset of data and vice-versa.
    8. Dummy variables are numerical values (0 or 1) that can represent string or categorical variables in an algorithm. This transformation is often required for regression models that are designed to read numerical inputs.
    9. Data should be split into training and test sets to prevent overfitting, train and evaluate the model, and ensure that it can generalize well to real-world data that it has not seen.
    10. The learning rate is the size of the step taken in each iteration of gradient descent, which determines how quickly the algorithm converges towards the local or global minimum of the error function.

    Essay Questions

    1. Discuss the importance of data preprocessing in machine learning projects. What are some common data preprocessing techniques, and why are they necessary?
    2. Compare and contrast the strengths and weaknesses of different types of machine learning algorithms (e.g., supervised vs. unsupervised, linear vs. non-linear, etc.). Provide specific examples to illustrate your points.
    3. Explain the concept of bias and variance in machine learning. How can these issues be addressed when building predictive models?
    4. Describe the process of building a recommendation system, including the key challenges and techniques involved. Consider different data sources and evaluation methods.
    5. Discuss the ethical considerations that data scientists should take into account when working on machine learning projects. How can fairness and transparency be ensured in the development of AI systems?

    Glossary

    • Adam: An optimization algorithm that combines the benefits of AdaGrad and RMSprop, often used for training neural networks.
    • Bagging: A machine learning ensemble method that creates multiple models using random subsets of the training data to reduce variance.
    • Boosting: A machine learning ensemble method that combines weak learners into a strong learner by iteratively focusing on misclassified samples.
    • Central Limit Theorem: A theorem stating that the distribution of sample means approaches a normal distribution as the sample size increases.
    • Classification: A machine learning task that involves predicting the category or class of a given data point.
    • Clustering: An unsupervised learning technique that groups similar data points into clusters.
    • Confidence Interval: A range of values that is likely to contain the true population parameter with a certain level of confidence.
    • Cosine Similarity: A measure of similarity between two non-zero vectors, often used in recommendation systems.
    • DB Scan: A density-based clustering algorithm that identifies clusters based on data point density.
    • Decision Trees: A supervised learning algorithm that uses a tree-like structure to make decisions based on input features.
    • Dummy Variable: A binary variable (0 or 1) used to represent categorical variables in a regression model.
    • Entropy: A measure of disorder or randomness in a dataset, particularly used in decision trees.
    • Feature Engineering: The process of transforming raw data into features that can be used in machine learning models.
    • Gradient Descent: An optimization algorithm used to minimize the error function of a model by iteratively updating parameters.
    • Heteroskedasticity: A condition in which the variance of the error terms in a regression model is not constant across observations.
    • Homoskedasticity: A condition in which the variance of the error terms in a regression model is constant across observations.
    • Hypothesis Testing: A statistical method used to determine whether there is enough evidence to reject a null hypothesis.
    • Inferential Statistics: A branch of statistics that deals with drawing conclusions about a population based on a sample of data.
    • K-Means: A clustering algorithm that partitions data points into a specified number of clusters based on their distance from cluster centers.
    • K-Nearest Neighbors (KNN): A supervised learning algorithm that classifies or predicts data based on the majority class among its nearest neighbors.
    • Law of Large Numbers: A theorem stating that as the sample size increases, the sample mean will converge to the population mean.
    • Linear Discriminant Analysis (LDA): A dimensionality reduction and classification technique that finds linear combinations of features to separate classes.
    • Logarithm: The inverse operation of exponentiation, used to find the exponent required to reach a certain value.
    • Mini-batch Gradient Descent: An optimization method that updates parameters based on a subset of the training data in each iteration.
    • Momentum (in Gradient Descent): A technique used with gradient descent that adds a fraction of the previous parameter update to the current update, which reduces oscillations during the search for local or global minima.
    • Multi-colinearity: A condition in which independent variables in a regression model are highly correlated with each other.
    • Ordinary Least Squares (OLS): A method for estimating the parameters of a linear regression model by minimizing the sum of squared residuals.
    • Overfitting: When a model learns the training data too well and cannot generalize to unseen data.
    • P-value: The probability of obtaining a result as extreme as the observed result, assuming the null hypothesis is true.
    • Random Forest: An ensemble learning method that combines multiple decision trees to make predictions.
    • Regression: A machine learning task that involves predicting a continuous numerical output.
    • Residual: The difference between the actual value of the dependent variable and the value predicted by a regression model.
    • Residual Sum of Squares (RSS): A metric that calculates the sum of the squared differences between the actual and predicted values.
    • RMSprop: An optimization algorithm that adapts the learning rate for each parameter based on the root mean square of past gradients.
    • R-squared (R²): A statistical measure that indicates the proportion of variance in the dependent variable that is explained by the independent variables in a regression model.
    • Standard Deviation: A measure of the amount of variation or dispersion in a set of values.
    • Statistical Significance: A concept that determines if a given finding is likely not due to chance; statistical significance is determined through the calculation of a p-value.
    • Stochastic Gradient Descent (SGD): An optimization algorithm that updates parameters based on a single random sample of the training data in each iteration.
    • Stop Words: Common words in a language that are often removed from text during preprocessing (e.g., “the,” “is,” “a”).
    • Supervised Learning: A type of machine learning where a model is trained using labeled data to make predictions.
    • Unsupervised Learning: A type of machine learning where a model is trained using unlabeled data to discover patterns or clusters.

    AI, Machine Learning, and Data Science Foundations

    Okay, here is a detailed briefing document synthesizing the provided sources.

    Briefing Document: AI, Machine Learning, and Data Science Foundations

    Overview

    This document summarizes key concepts and techniques discussed in the provided material. The sources primarily cover a range of topics, including: foundational mathematical and statistical concepts, various machine learning algorithms, deep learning and generative AI, model evaluation techniques, practical application examples in customer segmentation and sales analysis, and finally optimization methods and concepts related to building a recommendation system. The materials appear to be derived from a course or a set of educational resources aimed at individuals seeking to develop skills in AI, machine learning and data science.

    Key Themes and Ideas

    1. Foundational Mathematics and Statistics
    • Essential Math Concepts: A strong foundation in mathematics is crucial. The materials emphasize the importance of understanding exponents, logarithms, the mathematical constant “e,” and pi. Crucially, understanding how these concepts transform when taking derivatives is critical for many machine learning algorithms. For instance, the material mentions that “you need to know what is logarithm what is logarithm at the base of two what is logarithm at the base of e and then at the base of 10…and how does those transform when it comes to taking derivative of the logarithm taking the derivative of the exponent.”
    • Statistical Foundations: The course emphasizes descriptive and inferential statistics. Descriptive measures include “distance measures” and “variational measures.” Inferential statistics requires an understanding of theories such as the “Central limit theorem” and “the law of large numbers.” There is also the need to grasp “population sample,” “unbiased sample,” “hypothesis testing,” “confidence interval,” and “statistical significance.” The importance is highlighted that “you need to know those Infamous theories such as Central limit theorem the law of uh large numbers uh and how you can um relate to this idea of population sample unbias sample and also u a hypothesis testing confidence interval statistical sign ific an uh and uh how you can test different theories by using uh this idea of statistical”.
    1. Machine Learning Algorithms:
    • Supervised Learning: The course covers various supervised learning algorithms, including:
    • “Linear discriminant analysis” (LDA): Used for classification by combining multiple features to predict outcomes, as shown in the example of predicting movie preferences by combining movie length and genre.
    • “K-Nearest Neighbors” (KNN)
    • “Decision Trees”: Used for both classification and regression tasks.
    • “Random Forests”: An ensemble method that combines multiple decision trees.
    • Boosting Algorithms (e.g. “light GBM, GBM, HG Boost”): Another approach to improve model performance by sequentially training models. The training of these algorithms incorporates “previous stump’s errors.”
    • Unsupervised Learning:“K-Means”: A clustering algorithm for grouping data points. Example is given in customer segmentation by their transaction history, “you can for instance use uh K means uh DB scan hierarchal clustering and then you can evaluate your uh clustering algoritms and then select the one that performs the best”.
    • “DBScan”: A density-based clustering algorithm, noted for its increasing popularity.
    • “Hierarchical Clustering”: Another approach to clustering.
    • Bagging: An ensemble method used to reduce variance and create more stable predictions, exemplified through a weight loss prediction based on “daily calorie intake and workout duration.”
    • AdaBoost: An algorithm where “each stump is made by using the previous stump’s errors”, also used for building prediction models, exemplified with a housing price prediction project.
    1. Deep Learning and Generative AI
    • Optimization Algorithms: The material introduces the need for “Adam W RMS prop” optimization techniques.
    • Generative Models: The course touches upon more advanced topics including “variation Auto encoders” and “large language models.”
    • Natural Language Processing (NLP): It emphasizes the importance of understanding concepts like “n-grams,” “attention mechanisms” (both self-attention and multi-head self-attention), “encoder-decoder architecture of Transformers,” and related algorithms such as “gpts or Birch model.” The sources emphasize “if you want to move towards the NLP side of generative Ai and you want to know how the ched GPT has been invented how the gpts work or the birth mode Ro uh then you will definitely need to uh get into this topic of language model”.
    1. Model Evaluation
    • Regression Metrics: The document introduces “residual sum of squares” (RSS) as a common metric for evaluating linear regression models. The formula for the RSS is explicitly provided: “the RSS or the residual sum of square or the beta is equal to sum of all the squar of y i minus y hat across all I is equal to 1 till n”.
    • Clustering Metrics: The course mentions entropy, and the “Silo score” which is “a measure of the similarity of the data point to its own cluster compared to the other clusters”.
    • Regularization: The use of L2 regularization is mentioned, where “Lambda which is always positive so is always larger than equal zero is the tuning parameter or the penalty” and “the Lambda serves to control the relative impact of the penalty on the regression coefficient estimates.”
    1. Practical Applications and Case Studies:
    • Customer Segmentation: Clustering algorithms (K-means, DBScan) can be used to segment customers based on transaction history.
    • Sales Analysis: The material includes analysis of customer types, “consumer, corporate, and home office”, top spending customers, and sales trends over time. There is a suggestion that “a seasonal Trend” might be apparent if a longer time period is considered.
    • Geographic Sales Mapping: The material includes using maps to visualize sales per state, which is deemed helpful for companies looking to expand into new geographic areas.
    • Housing Price Prediction: A linear regression model is applied to predict house prices using features like median income, average rooms, and proximity to the ocean. An important note is made about the definition of “residual” in this context, with the reminder that “you do not confuse the error with the residual so error can never be observed error you can never calculate and you will never know but what you can do is to predict the error and you can when you predict the error then you get a residual”.
    1. Linear Regression and OLS
    • Regression Model: The document explains that the linear regression model aims to estimate the relationship between independent and dependent variables. In the context, it emphasizes that “beta Z that you see here is not a variable and it’s called intercept or constant something that is unknown so we don’t have that in our data and is one of the parameters of linear regression it’s an unknown number which the linear regression model should estimate”.
    • Ordinary Least Squares (OLS): OLS is a core method to minimize the “sum of squared residuals”. The material states that “the OLS tries to find the line that will minimize its value”.
    • Assumptions: The materials mention an assumption of constant variance (homoscedasticity) for errors, and notes “you can check for this assumption by plotting the residual and see whether there is a funnel like graph”. The importance of using a correct statistical test is also highlighted when considering p values.
    • Dummy Variables: The need to transform categorical features into dummy variables to be used in linear regression models, with the warning that “you always need to drop at least one of the categories” due to the multicolinearity problem. The process of creating dummy variables is outlined: “we will use the uh get uncore d function in Python from pandas in order to uh go from this one variable to uh five different variable per each of this category”.
    • Variable Interpretation: Coefficients in a linear regression model represent the impact of an independent variable on the dependent variable. For example, the material notes, “when we look at the total number of rooms and we increase the number of rooms by uh one additional unit so one more room added to the total underscore rooms then the uh house value uh decreases by minus 2.67”.
    • Model Summary Output: The materials discuss interpreting model output metrics such as R-squared which “is the Matrix that show cases what is the um goodness of fit of your model”. It also mentions how to interpret p values.
    1. Recommendation Systems
    • Feature Engineering: A critical step is identifying and engineering the appropriate features, with the recommendation system based on “data points you use to make decisions about what to recommend”.
    • Text Preprocessing: Text data must be cleaned and preprocessed, including removing “stop words” and vectorizing using TF-IDF or similar methods. An example is given “if we use no pen we use no action pack we use denture once we use movies once you 233 use Inspire once and you re use me once and the rest we don’t use it SWS which means we get the vector 0 0 1 1 1 1 0 0 zero here”.
    • Cosine Similarity: A technique to find similarity between text vectors. The cosine similarity is defined as “an equation of the dot product of two vectors and the multiplication of the magnitudes of the two vectors”.
    • Recommending: The system then recommends items with the highest cosine similarity scores, as mentioned with “we are going to provide we are going to recommend five movies of course you can recommend many or 50 movies that’s completely up to [Music] you”.
    1. Career Advice and Perspective
    • The Importance of a Plan: The material emphasizes the value of creating a career plan and focusing on actionable steps. The advice is “this kind of plan actually make you focus because if you are not focusing on that thing you could just going anywhere at that lose loose loose loose lose your way”.
    • Learning by Doing: The speaker advocates doing smaller projects to prove your abilities, especially as a junior data scientist. As they state, “the best way is like yeah just do the work if like a smaller like as you said previously youly like it might be boring stuff it might be an assum it might be not leading anywhere but those kind of work show”.
    • Business Acumen: Data scientists should focus on how their work provides value to the business, and “data scientist is someone who bring the value to the business and making the decision for the battle any business”.
    • Personal Branding: Building a personal brand is also seen as important, with the recommendation that “having a newsletter and having a LinkedIn following” can help. Technical portfolio sites like “GitHub” are recommended.
    • Data Scientist Skills: The ability to show your thought process and motivation is important in data science interviews. As the speaker notes, “how’s your uh thought process going how’s your what what motivated you to do this kind of project what motivated you to do uh this kind of code what motivated you to present this kinde of result”.
    • Future of Data Science: The future of data science is predicted to become “invaluable to the business”, especially given the current rapid development of AI.
    • Business Fundamentals: The importance of thinking about the needs-based aspect of a business, that it must be something people need or “if my roof was leaking and it’s raining outside and I’m in my house you know and water is pouring on my head I have to fix that whether I’m broke or not you know”.
    • Entrepreneurship: The importance of planning, which was inspired by being a pilot where “pilots don’t take off unless we know where we’re going”.
    • Growth: The experience at GE emphasized that “growing so fast it was doubling in size every three years and that that really informed my thinking about growth”.
    • Mergers and Aquisitions (M&A): The business principle of using debt to buy underpriced assets that can be later sold at a higher multiple for profit.
    1. Optimization
    • Gradient Descent (GD): The update of the weight is equal to the current weight parameter minus the learning rate times the gradient and so “the same we also do for our second parameter which is the bias Factor”.
    • Stochastic Gradient Descent (SGD): HGD is different from GD in that it “uses the gradient from a single data point which is just one observation in order to update our parameters”. This makes it “much faster and computationally much less expensive compared to the GD”.
    • SGD With Momentum: SGD with momentum addresses the disadvantages of the basic SGD algorithm.
    • Mini-Batch Gradient Descent: A trade-off between the two, and “it tries to strike a balance by selecting smaller batches and calculating the gradient over them”.
    • RMSprop: RMSprop is introduced as an algorithm for controlling learning rates, where “for the parameters that will have a small gradients we will be then controlling this and we will be increasing their learning rate to ensure that the gradient will not vanish”.

    Conclusion

    These materials provide a broad introduction to data science, machine learning, and AI. They cover mathematical and statistical foundations, various algorithms (both supervised and unsupervised), deep learning concepts, model evaluation, and provide case studies to illustrate the practical application of such techniques. The inclusion of career advice and reflections makes it a very holistic learning experience. The information is designed to build a foundational understanding and introduce more complex concepts.

    Essential Concepts in Machine Learning

    Frequently Asked Questions

    • What are some real-world applications of machine learning, as discussed in the context of this course? Machine learning has diverse applications, including optimizing crop yields by monitoring soil health, and predicting customer preferences, such as in the entertainment industry as seen with Netflix’s recommendations. It’s also useful in customer segmentation (identifying “good”, “better”, and “best” customers based on transaction history) and creating personalized recommendations (like prioritizing movies based on a user’s preferred genre). Further, machine learning can help companies decide which geographic areas are most promising for their products based on sales data and can help investors identify which features of a house are correlated with its value.
    • What are the core mathematical concepts that are essential for understanding machine learning and data science? A foundational understanding of several mathematical concepts is critical. This includes: the idea of using variables with different exponents (e.g., X, X², X³), understanding logarithms at different bases (base 2, base e, base 10), comprehending the meaning of ‘e’ and ‘Pi’, mastering exponents and logarithms and how they transform when taking derivatives. A fundamental understanding of descriptive (distance measures, variational measures) and inferential statistics (central limit theorem, law of large numbers, population vs. sample, hypothesis testing) is also essential.
    • What specific machine learning algorithms should I be familiar with, and what are their uses? The course highlights the importance of both supervised and unsupervised learning techniques. For supervised learning, you should know linear discriminant analysis (LDA), K-Nearest Neighbors (KNN), decision trees (for both classification and regression), random forests, and boosting algorithms like light GBM, GBM, and XGBoost. For unsupervised learning, understanding K-Means clustering, DBSCAN, and hierarchical clustering is crucial. These algorithms are used in various applications like classification, clustering, and regression.
    • How can I assess the performance of my machine learning models? Several metrics are used to evaluate model performance, depending on the task at hand. For regression models, the residual sum of squares (RSS) is crucial; it measures the difference between predicted and actual values. Metrics like entropy, also the Gini index, and the silhouette score (which measures the similarity of a data point to its own cluster vs. other clusters) are used for evaluating classification and clustering models. Additionally, concepts like the penalty term, used to control impact of model complexity, and the L2 Norm used in regression are highlighted as important for proper evaluation.
    • What is the significance of linear regression and what key concepts should I know? Linear regression is used to model the relationship between a dependent variable (Y) and one or more independent variables (X). A crucial aspect is estimating coefficients (betas) and intercepts which quantify these relationships. It is key to understand concepts like the residuals (differences between predicted and actual values), and how ordinary least squares (OLS) is used to minimize the sum of squared residuals. In understanding linear regression, it is also important not to confuse errors (which are never observed and can’t be calculated) with residuals (which are predictions of errors). It’s also crucial to be aware of assumptions about your errors and their variance.
    • What are dummy variables, and why are they used in modeling? Dummy variables are binary (0 or 1) variables used to represent categorical data in regression models. When transforming categorical variables like ocean proximity (with categories such as near bay, inland, etc.), each category becomes a separate dummy variable. The “1” indicates that a condition is met, and a “0” indicates that it is not. It is essential to drop one of these dummy variables to avoid perfect multicollinearity (where one variable is predictable from other variables) which could cause an OLS violation.
    • What are some of the main ideas behind recommendation systems as discussed in the course? Recommendation systems rely on data points to identify similarities between items to generate personalized results. Text data preprocessing is often done using techniques like tokenization, removing stop words, and stemming to convert data into vectors. Cosine similarity is used to measure the angle between two vector representations. This allows one to calculate how similar different data points (such as movies) are, based on common features (like genre, plot keywords). For example, a movie can be represented as a vector in a high-dimensional space that captures different properties about the movie. This approach enables recommendations based on calculated similarity scores.
    • What key steps and strategies are recommended for aspiring data scientists? The course emphasizes several critical steps. It’s important to start with projects to demonstrate the ability to apply data science skills. This includes going beyond basic technical knowledge and considering the “why” behind projects. A focus on building a personal brand, which can be done through online platforms like LinkedIn, GitHub, and Medium is recommended. Understanding the business value of data science is key, which includes communicating project findings effectively. Also emphasized is creating a career plan and acting responsibly for your career choices. Finally, focusing on a niche or specific sector is recommended to ensure that one’s technical skills match the business needs.

    Fundamentals of Machine Learning

    Machine learning (ML) is a branch of artificial intelligence (AI) that builds models based on data, learns from that data, and makes decisions [1]. ML is used across many industries, including healthcare, finance, entertainment, marketing, and transportation [2-9].

    Key Concepts in Machine Learning:

    • Supervised Learning: Algorithms are trained using labeled data [10]. Examples include regression and classification models [11].
    • Regression: Predicts continuous values, such as house prices [12, 13].
    • Classification: Predicts categorical values, such as whether an email is spam [12, 14].
    • Unsupervised Learning: Algorithms are trained using unlabeled data, and the model must find patterns without guidance [11]. Examples include clustering and outlier detection techniques [12].
    • Semi-Supervised Learning: A combination of supervised and unsupervised learning [15].

    Machine Learning Algorithms:

    • Linear Regression: A statistical or machine learning method used to model the impact of a change in a variable [16, 17]. It can be used for causal analysis and predictive analytics [17].
    • Logistic Regression: Used for classification, especially with binary outcomes [14, 15, 18].
    • K-Nearest Neighbors (KNN): A classification algorithm [19, 20].
    • Decision Trees: Can be used for both classification and regression [19, 21]. They are transparent and handle diverse data, making them useful in various industries [22-25].
    • Random Forest: An ensemble learning method that combines multiple decision trees, suitable for classification and regression [19, 26, 27].
    • Boosting Algorithms: Such as AdaBoost, light GBM, GBM, and XGBoost, build trees using information from previous trees to improve performance [19, 28, 29].
    • K-Means: A clustering algorithm [19, 30].
    • DB Scan: A clustering algorithm that is becoming increasingly popular [19].
    • Hierarchical Clustering: Another clustering technique [19, 30].

    Important Steps in Machine Learning:

    • Data Preparation: This involves splitting data into training and test sets and handling missing values [31-33].
    • Feature Engineering: Identifying and selecting the most relevant data points (features) to be used by the model to generate the most accurate results [34, 35].
    • Model Training: Selecting an appropriate algorithm and training it on the training data [36].
    • Model Evaluation: Assessing model performance using appropriate metrics [37].

    Model Evaluation Metrics:

    • Regression Models:
    • Residual Sum of Squares (RSS) [38].
    • Mean Squared Error (MSE) [38, 39].
    • Root Mean Squared Error (RMSE) [38, 39].
    • Mean Absolute Error (MAE) [38, 39].
    • Classification Models:
    • Accuracy: Proportion of correctly classified instances [40].
    • Precision: Measures the accuracy of positive predictions [40].
    • Recall: Measures the model’s ability to identify all positive instances [40].
    • F1 Score: Combines precision and recall into a single metric [39, 40].

    Bias-Variance Tradeoff:

    • Bias: The inability of a model to capture the true relationship in the data [41]. Complex models tend to have low bias but high variance [41-43].
    • Variance: The sensitivity of a model to changes in the training data [41-43]. Simpler models have low variance but high bias [41-43].
    • Overfitting: Occurs when a model learns the training data too well, including noise [44, 45]. This results in poor performance on unseen data [44].
    • Underfitting: Occurs when a model is too simple to capture the underlying patterns in the data [45].

    Techniques to address overfitting:

    • Reducing model complexity: Using simpler models to reduce the chances of overfitting [46].
    • Cross-validation: Using different subsets of data for training and testing to get a more realistic measure of model performance [46].
    • Early stopping: Monitoring the model performance and stopping the training process when it begins to decrease [47].
    • Regularization techniques: Such as L1 and L2 regularization, helps to prevent overfitting by adding penalty terms that reduce the complexity of the model [48-50].

    Python and Machine Learning:

    • Python is a popular programming language for machine learning because it has a lot of libraries, including:
    • Pandas: For data manipulation and analysis [51].
    • NumPy: For numerical operations [51, 52].
    • Scikit-learn (sklearn): For machine learning algorithms and tools [13, 51-59].
    • SciPy: For scientific computing [51].
    • NLTK: For natural language processing [51].
    • TensorFlow and PyTorch: For deep learning [51, 60, 61].
    • Matplotlib: For data visualization [52, 62, 63].
    • Seaborn: For data visualization [62].

    Natural Language Processing (NLP):

    • NLP is used to process and analyze text data [64, 65].
    • Key steps include: text cleaning (lowercasing, punctuation removal, tokenization, stemming, and lemmatization), and converting text to numerical data with techniques such as TF-IDF, word embeddings, subword embeddings and character embeddings [66-68].
    • NLP is used in applications such as chatbots, virtual assistants, and recommender systems [7, 8, 66].

    Deep Learning:

    • Deep learning is an advanced form of machine learning that uses neural networks with multiple layers [7, 60, 68].
    • Examples include:
    • Recurrent Neural Networks (RNNs) [69, 70].
    • Artificial Neural Networks (ANNs) [69].
    • Convolutional Neural Networks (CNNs) [69, 70].
    • Generative Adversarial Networks (GANs) [69].
    • Transformers [8, 61, 71-74].

    Practical Applications of Machine Learning:

    • Recommender Systems: Suggesting products, movies, or jobs to users [6, 9, 64, 75-77].
    • Predictive Analytics: Using data to forecast future outcomes, such as house prices [13, 17, 78].
    • Fraud Detection: Identifying fraudulent transactions in finance [4, 27, 79].
    • Customer Segmentation: Grouping customers based on their behavior [30, 80].
    • Image Recognition: Classifying images [14, 81, 82].
    • Autonomous Vehicles: Enabling self-driving cars [7].
    • Chatbots and virtual assistants: Providing automated customer support using NLP [8, 18, 83].

    Career Paths in Machine Learning:

    • Machine Learning Researcher: Focuses on developing and testing new machine learning algorithms [84, 85].
    • Machine Learning Engineer: Focuses on implementing and deploying machine learning models [85-87].
    • AI Researcher: Similar to machine learning researcher but focuses on more advanced models like deep learning and generative AI [70, 74, 88].
    • AI Engineer: Similar to machine learning engineer but works with more advanced AI models [70, 74, 88].
    • Data Scientist: A broad role that uses data analysis, statistics, and machine learning to solve business problems [54, 89-93].

    Additional Considerations:

    • It’s important to develop not only technical skills, but also communication skills, business acumen, and the ability to translate business needs into data science problems [91, 94-96].
    • A strong data science portfolio is key for getting into the field [97].
    • Continuous learning is essential to keep up with the latest technology [98, 99].
    • Personal branding can open up many opportunities [100].

    This overview should provide a strong foundation in the fundamentals of machine learning.

    A Comprehensive Guide to Data Science

    Data science is a field that uses data analysis, statistics, and machine learning to solve business problems [1, 2]. It is a broad field with many applications, and it is becoming increasingly important in today’s world [3]. Data science is not just about crunching numbers; it also involves communication, business acumen, and translation skills [4].

    Key Aspects of Data Science:

    • Data Analysis: Examining data to understand patterns and insights [5, 6].
    • Statistics: Applying statistical methods to analyze data, test hypotheses and make inferences [7, 8].
    • Descriptive statistics, which includes measures like mean, median, and standard deviation, helps in summarizing data [8].
    • Inferential statistics, which involves concepts like the central limit theorem and hypothesis testing, help in drawing conclusions about a population based on a sample [9].
    • Probability distributions are also important in understanding machine learning concepts [10].
    • Machine Learning (ML): Using algorithms to build models based on data, learn from it, and make decisions [2, 11-13].
    • Supervised learning involves training algorithms on labeled data for tasks like regression and classification [13-16]. Regression is used to predict continuous values, while classification is used to predict categorical values [13, 17].
    • Unsupervised learning involves training algorithms on unlabeled data to identify patterns, as in clustering and outlier detection [13, 18, 19].
    • Programming: Using programming languages such as Python to implement data science techniques [20]. Python is popular due to its versatility and many libraries [20, 21].
    • Libraries such as Pandas and NumPy are used for data manipulation [22, 23].
    • Scikit-learn is used for implementing machine learning models [22, 24, 25].
    • TensorFlow and PyTorch are used for deep learning [22, 26].
    • Libraries such as Matplotlib and Seaborn are used for data visualization [17, 25, 27, 28].
    • Data Visualization: Representing data through charts, graphs, and other visual formats to communicate insights [25, 27].
    • Business Acumen: Understanding business needs and translating them into data science problems and solutions [4, 29].

    The Data Science Process:

    1. Data Collection: Gathering relevant data from various sources [30].
    2. Data Preparation: Cleaning and preprocessing data, which involves:
    • Handling missing values by removing or imputing them [31, 32].
    • Identifying and removing outliers [32-35].
    • Data wrangling: transforming and cleaning data for analysis [6].
    • Data exploration: using descriptive statistics and data visualization to understand the data [36-39].
    • Data Splitting: Dividing data into training, validation, and test sets [14].
    1. Feature Engineering: Identifying, selecting, and transforming variables [40, 41].
    2. Model Training: Selecting an appropriate algorithm, training it on the training data, and optimizing it with validation data [14].
    3. Model Evaluation: Assessing model performance using relevant metrics on the test data [14, 42].
    4. Deployment and Communication: Communicating results and translating them into actionable insights for stakeholders [43].

    Applications of Data Science:

    • Business and Finance: Customer segmentation, fraud detection, credit risk assessment [44-46].
    • Healthcare: Disease diagnosis, risk prediction, treatment planning [46, 47].
    • Operations Management: Optimizing decision-making using data [44].
    • Engineering: Fault diagnosis [46-48].
    • Biology: Classification of species [47-49].
    • Customer service: Developing troubleshooting guides and chatbots [47-49].
    • Recommender systems are used in entertainment, marketing, and other industries to suggest products or movies to users [30, 50, 51].
    • Predictive Analytics are used to forecast future outcomes [24, 41, 52].

    Key Skills for Data Scientists:

    • Technical Skills: Proficiency in programming languages such as Python and knowledge of relevant libraries. Also expertise in statistics, mathematics, and machine learning [20].
    • Communication Skills: Ability to communicate results to technical and non-technical audiences [4, 43].
    • Business Skills: Understanding business requirements and translating them into data-driven solutions [4, 29].
    • Problem-solving skills: Ability to define, analyze, and solve complex problems [4, 29].

    Career Paths in Data Science:

    • Data Scientist
    • Machine Learning Engineer
    • AI Engineer
    • Data Science Manager
    • NLP Engineer
    • Data Analyst

    Additional Considerations:

    • A strong portfolio demonstrating data science project is essential to showcase practical skills [53-56].
    • Continuous learning is necessary to keep up with the latest technology in the field [57].
    • Personal branding can enhance opportunities in data science [58-61].
    • Data scientists must be able to adapt to the evolving landscape of AI and machine learning [62, 63].

    This information should give a comprehensive overview of the field of data science.

    Artificial Intelligence: Applications Across Industries

    Artificial intelligence (AI) has a wide range of applications across various industries [1, 2]. Machine learning, a branch of AI, is used to build models based on data and learn from this data to make decisions [1].

    Here are some key applications of AI:

    • Healthcare: AI is used in the diagnosis of diseases, including cancer, and for identifying severe effects of illnesses [3]. It also helps with drug discovery, personalized medicine, treatment plans, and improving hospital operations [3, 4]. Additionally, AI helps in predicting the number of patients that a hospital can expect in the emergency room [4].
    • Finance: AI is used for fraud detection in credit card and banking operations [5]. It is also used in trading, combined with quantitative finance, to help traders make decisions about stocks, bonds, and other assets [5].
    • Retail: AI helps in understanding and estimating demand for products, determining the most appropriate warehouses for shipping, and building recommender systems and search engines [5, 6].
    • Marketing: AI is used to understand consumer behavior and target specific groups, which helps reduce marketing costs and increase conversion rates [7, 8].
    • Transportation: AI is used in autonomous vehicles and self-driving cars [8].
    • Natural Language Processing (NLP): AI is behind applications such as chatbots, virtual assistants, and large language models [8, 9]. These tools use text data to answer questions and provide information [9].
    • Smart Home Devices: AI powers smart home devices like Alexa [9].
    • Agriculture: AI is used to estimate weather conditions, predict crop production, monitor soil health, and optimize crop yields [9, 10].
    • Entertainment: AI is used to build recommender systems that suggest movies and other content based on user data. Netflix is a good example of a company that uses AI in this way [10, 11].
    • Customer service: AI powers chatbots that can categorize customer inquiries and provide appropriate responses, reducing wait times and improving support efficiency [12-15].
    • Game playing: AI is used to design AI opponents in games [13, 14, 16].
    • E-commerce: AI is used to provide personalized product recommendations [14, 16].
    • Human Resources: AI helps to identify factors influencing employee retention [16, 17].
    • Fault Diagnosis: AI helps isolate the cause of malfunctions in complex systems by analyzing sensor data [12, 18].
    • Biology: AI is used to categorize species based on characteristics or DNA sequences [12, 15].
    • Remote Sensing: AI is used to analyze satellite imagery and classify land cover types [12, 15].

    In addition to these, AI is also used in many areas of data science, such as customer segmentation [19-21], fraud detection [19-22], credit risk assessment [19-21], and operations management [19, 21, 23, 24].

    Overall, AI is a powerful technology with a wide range of applications that improve efficiency, decision-making, and customer experience in many areas [11].

    Essential Python Libraries for Data Science

    Python libraries are essential tools in data science, machine learning, and AI, providing pre-written functions and modules that streamline complex tasks [1]. Here’s an overview of the key Python libraries mentioned in the sources:

    • Pandas: This library is fundamental for data manipulation and analysis [2, 3]. It provides data structures like DataFrames, which are useful for data wrangling, cleaning, and preprocessing [3, 4]. Pandas is used for tasks such as reading data, handling missing values, identifying outliers, and performing data filtering [3, 5].
    • NumPy: NumPy is a library for numerical computing in Python [2, 3, 6]. It is used for working with arrays and matrices and performing mathematical operations [3, 7]. NumPy is essential for data visualization and other tasks in machine learning [3].
    • Matplotlib: This library is used for creating visualizations like plots, charts, and histograms [6-8]. Specifically, pyplot is a module within Matplotlib used for plotting [9, 10].
    • Seaborn: Seaborn is another data visualization library that is known for creating more appealing visualizations [8, 11].
    • Scikit-learn (psyit learn): This library provides a wide range of machine learning algorithms and tools for tasks like regression, classification, clustering, and model evaluation [2, 6, 10, 12]. It includes modules for model selection, ensemble learning, and metrics [13]. Scikit-learn also includes tools for data preprocessing, such as splitting the data into training and testing sets [14, 15].
    • Statsmodels: This library is used for statistical modeling and econometrics and has capabilities for linear regression [12, 16]. It is particularly useful for causal analysis because it provides detailed statistical summaries of model results [17, 18].
    • NLTK (Natural Language Toolkit): This library is used for natural language processing tasks [2]. It is helpful for text data cleaning, such as tokenization, stemming, lemmatization, and stop word removal [19, 20]. NLTK also assists in text analysis and processing [21].
    • TensorFlow and PyTorch: These are deep learning frameworks used for building and training neural networks and implementing deep learning models [2, 22, 23]. They are essential for advanced machine learning tasks, such as building large language models [2].
    • Pickle: This library is used for serializing and deserializing Python objects, which is useful for saving and loading models and data [24, 25].
    • Requests: This library is used for making HTTP requests, which is useful for fetching data from web APIs, like movie posters [25].

    These libraries facilitate various stages of the data science workflow [26]:

    • Data loading and preparation: Libraries like Pandas and NumPy are used to load, clean, and transform data [2, 26].
    • Data visualization: Libraries like Matplotlib and Seaborn are used to create plots and charts that help to understand data and communicate insights [6-8].
    • Model training and evaluation: Libraries like Scikit-learn and Statsmodels are used to implement machine learning algorithms, train models, and evaluate their performance [2, 12, 26].
    • Deep learning: Frameworks such as TensorFlow and PyTorch are used for building complex neural networks and deep learning models [2, 22].
    • Natural language processing: Libraries such as NLTK are used for processing and analyzing text data [2, 27].

    Mastering these Python libraries is crucial for anyone looking to work in data science, machine learning, or AI [1, 26]. They provide the necessary tools for implementing a wide array of tasks, from basic data analysis to advanced model building [1, 2, 22, 26].

    Machine Learning Model Evaluation

    Model evaluation is a crucial step in the machine learning process that assesses the performance and effectiveness of a trained model [1, 2]. It involves using various metrics to quantify how well the model is performing, which helps to identify whether the model is suitable for its intended purpose and how it can be improved [2-4]. The choice of evaluation metrics depends on the specific type of machine learning problem, such as regression or classification [5].

    Key Concepts in Model Evaluation:

    • Performance Metrics: These are measures used to evaluate how well a model is performing. Different metrics are appropriate for different types of tasks [5, 6].
    • For regression models, common metrics include:
    • Residual Sum of Squares (RSS): Measures the sum of the squares of the differences between the predicted and true values [6-8].
    • Mean Squared Error (MSE): Calculates the average of the squared differences between predicted and true values [6, 7].
    • Root Mean Squared Error (RMSE): The square root of the MSE, which provides a measure of the error in the same units as the target variable [6, 7].
    • Mean Absolute Error (MAE): Calculates the average of the absolute differences between predicted and true values. MAE is less sensitive to outliers compared to MSE [6, 7, 9].
    • For classification models, common metrics include:
    • Accuracy: Measures the proportion of correct predictions made by the model [9, 10].
    • Precision: Measures the proportion of true positive predictions among all positive predictions made by the model [7, 9, 10].
    • Recall: Measures the proportion of true positive predictions among all actual positive instances [7, 9, 11].
    • F1 Score: The harmonic mean of precision and recall, providing a balanced measure of a model’s performance [7, 9].
    • Area Under the Curve (AUC): A metric used when plotting the Receiver Operating Characteristic (ROC) curve to assess the performance of binary classification models [12].
    • Cross-entropy: A loss function used to measure the difference between the predicted and true probability distributions, often used in classification problems [7, 13, 14].
    • Bias and Variance: These concepts are essential for understanding model performance [3, 15].
    • Bias refers to the error introduced by approximating a real-world problem with a simplified model, which can cause the model to underfit the data [3, 4].
    • Variance measures how much the model’s predictions vary for different training data sets; high variance can cause the model to overfit the data [3, 16].
    • Overfitting and Underfitting: These issues can affect model accuracy [17, 18].
    • Overfitting occurs when a model learns the training data too well, including noise, and performs poorly on new, unseen data [17-19].
    • Underfitting occurs when a model is too simple and cannot capture the underlying patterns in the training data [17, 18].
    • Training, Validation, and Test Sets: Data is typically split into three sets [2, 20]:
    • Training Set: Used to train the model.
    • Validation Set: Used to tune model hyperparameters and prevent overfitting.
    • Test Set: Used to evaluate the final model’s performance on unseen data [20-22].
    • Hyperparameter Tuning: Adjusting model parameters to minimize errors and optimize performance, often using the validation set [21, 23, 24].
    • Cross-Validation: A resampling technique that allows the model to be trained and tested on different subsets of the data to assess its generalization ability [7, 25].
    • K-fold cross-validation divides the data into k subsets or folds and iteratively trains and evaluates the model by using each fold as the test set once [7].
    • Leave-one-out cross-validation uses each data point as a test set, training the model on all the remaining data points [7].
    • Early Stopping: A technique where the model’s performance on a validation set is monitored during the training process, and training is stopped when the performance starts to decrease [25, 26].
    • Ensemble Methods: Techniques that combine multiple models to improve performance and reduce overfitting. Some ensemble techniques are decision trees, random forests, and boosting techniques such as Adaboost, Gradient Boosting Machines (GBM), and XGBoost [26]. Bagging is an ensemble technique that reduces variance by training multiple models and averaging the results [27-29].

    Step-by-Step Process for Model Evaluation:

    1. Data Splitting: Divide the data into training, validation, and test sets [2, 20].
    2. Algorithm Selection: Choose an appropriate algorithm based on the problem and data characteristics [24].
    3. Model Training: Train the selected model using the training data [24].
    4. Hyperparameter Tuning: Adjust model parameters using the validation data to minimize errors [21].
    5. Model Evaluation: Evaluate the model’s performance on the test data using chosen metrics [21, 22].
    6. Analysis and Refinement: Analyze the results, make adjustments, and retrain the model if necessary [3, 17, 30].

    Importance of Model Evaluation:

    • Ensures Model Generalization: It helps to ensure that the model performs well on new, unseen data, rather than just memorizing the training data [22].
    • Identifies Model Issues: It helps in detecting issues like overfitting, underfitting, and bias [17-19].
    • Guides Model Improvement: It provides insights into how the model can be improved through hyperparameter tuning, data collection, or algorithm selection [21, 24, 25].
    • Validates Model Reliability: It validates the model’s ability to provide accurate and reliable results [2, 15].

    Additional Notes:

    • Statistical significance is an important concept in model evaluation to ensure that the results are unlikely to have occurred by random chance [31, 32].
    • When evaluating models, it is important to understand the trade-off between model complexity and generalizability [33, 34].
    • It is important to check the assumptions of the model, for example, when using linear regression, it is essential to check assumptions such as linearity, exogeneity, and homoscedasticity [35-39].
    • Different types of machine learning models should be evaluated using appropriate metrics. For example, classification models use metrics like accuracy, precision, recall, and F1 score, while regression models use metrics like MSE, RMSE, and MAE [6, 9].

    By carefully evaluating machine learning models, one can build reliable systems that address real-world problems effectively [2, 3, 40, 41].

    AI Foundations Course – Python, Machine Learning, Deep Learning, Data Science

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • SQL Fundamentals: Querying, Filtering, and Aggregating Data

    SQL Fundamentals: Querying, Filtering, and Aggregating Data

    The text is a tutorial on SQL, a language for managing and querying data. It highlights the fundamental differences between SQL and spreadsheets, emphasizing the organized structure of data in tables with defined schemas and relationships. The tutorial introduces core SQL concepts like statements, clauses (SELECT, FROM, WHERE), and the logical order of operations. It explains how to retrieve and filter data, perform calculations, aggregate results (SUM, COUNT, AVERAGE), and use window functions for more complex data manipulation without altering the data’s structure. The material also covers advanced techniques such as subqueries, Common Table Expressions (CTEs), and joins to combine data from multiple tables. The tutorial emphasizes the importance of Boolean algebra and provides practical exercises to reinforce learning.

    SQL Study Guide

    Review of Core Concepts

    This study guide focuses on the following key areas:

    • BigQuery Data Organization: How data is structured within BigQuery (Projects, Datasets, Tables).
    • SQL Fundamentals: Basic SQL syntax, clauses (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT).
    • Data Types and Schemas: Understanding data types and how they influence operations.
    • Logical Order of Operations: The sequence in which SQL operations are executed.
    • Boolean Algebra: Using logical operators (AND, OR, NOT) and truth tables.
    • Set Operations: Combining data using UNION, INTERSECT, EXCEPT.
    • CASE Statements: Conditional logic for data transformation.
    • Subqueries: Nested queries and their correlation.
    • JOIN Operations: Combining tables (INNER, LEFT, RIGHT, FULL OUTER).
    • GROUP BY and Aggregations: Summarizing data using aggregate functions (SUM, AVG, COUNT, MIN, MAX).
    • HAVING Clause: Filtering aggregated data.
    • Window Functions: Performing calculations across rows without changing the table’s structure (OVER, PARTITION BY, ORDER BY, ROWS BETWEEN).
    • Numbering Functions: Ranking and numbering rows (ROW_NUMBER, RANK, DENSE_RANK, NTILE).
    • Date and Time Functions: Extracting and manipulating date and time components.
    • Common Table Expressions (CTEs): Defining temporary result sets for complex queries.

    Quiz

    Answer each question in 2-3 sentences.

    1. Explain the relationship between projects, datasets, and tables in BigQuery.
    2. What is a SQL clause and can you provide three examples?
    3. Why is it important to understand data types when working with SQL?
    4. Describe the logical order of operations in SQL.
    5. Explain the purpose of Boolean algebra in SQL.
    6. Describe the difference between UNION, INTERSECT, and EXCEPT set operators.
    7. What is a CASE statement, and how is it used in SQL?
    8. Explain the difference between correlated and uncorrelated subqueries.
    9. Compare and contrast INNER JOIN, LEFT JOIN, and FULL OUTER JOIN.
    10. Explain the fundamental difference between GROUP BY aggregations and WINDOW functions.

    Quiz Answer Key

    1. BigQuery organizes data hierarchically, with projects acting as top-level containers, datasets serving as folders for tables within a project, and tables storing the actual data in rows and columns. Datasets organize tables, while projects organize datasets, offering a structured way to manage and access data.
    2. A SQL clause is a building block that makes up a complete SQL statement, defining specific actions or conditions. Examples include the SELECT clause to choose columns, the FROM clause to specify the table, and the WHERE clause to filter rows.
    3. Understanding data types is crucial because it dictates the types of operations that can be performed on a column and determines how data is stored and manipulated, and it also avoids errors and ensures accurate results.
    4. The logical order of operations determines the sequence in which SQL clauses are executed, starting with FROM, then WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT, impacting the query’s outcome.
    5. Boolean algebra allows for complex filtering and conditional logic within WHERE clauses using AND, OR, and NOT operators to specify precise conditions for row selection based on truth values.
    6. UNION combines the results of two or more queries into a single result set, INTERSECT returns only the rows that are common to all input queries, and EXCEPT returns the rows from the first query that are not present in the second query.
    7. A CASE statement allows for conditional logic within a SQL query, enabling you to define different outputs based on specified conditions, similar to an “if-then-else” structure.
    8. A correlated subquery depends on the outer query, executing once for each row processed, while an uncorrelated subquery is independent and executes only once, providing a constant value to the outer query.
    9. INNER JOIN returns only matching rows from both tables, LEFT JOIN returns all rows from the left table and matching rows from the right, filling in NULL for non-matches, while FULL OUTER JOIN returns all rows from both tables, filling in NULL where there are no matches.
    10. GROUP BY aggregations collapse multiple rows into a single row based on grouped values, while window functions perform calculations across a set of table rows that are related to the current row without collapsing or grouping rows.

    Essay Questions

    1. Discuss the importance of understanding the logical order of operations in SQL when writing complex queries. Provide examples of how misunderstanding this order can lead to unexpected results.
    2. Explain the different types of JOIN operations available in SQL, providing scenarios in which each type would be most appropriate. Illustrate with specific examples related to the course material.
    3. Describe the use of window functions in SQL. Include the purpose of PARTITION BY and ORDER BY. Explain some practical applications of these functions, emphasizing their ability to perform complex calculations without altering the structure of the table.
    4. Discuss the use of Common Table Expressions (CTEs) in SQL. How do they improve the readability and maintainability of complex queries? Provide an example of a query that benefits from the use of CTEs.
    5. Develop a SQL query using different levels of aggregations. Explain the query and explain its purpose.

    Glossary of Key Terms

    • Project (BigQuery): A top-level container for datasets and resources in BigQuery.
    • Dataset (BigQuery): A collection of tables within a BigQuery project, similar to a folder.
    • Table (SQL): A structured collection of data organized in rows and columns.
    • Schema (SQL): The structure of a table, including column names and data types.
    • Clause (SQL): A component of a SQL statement that performs a specific action (e.g., SELECT, FROM, WHERE).
    • Data Type (SQL): The type of data that a column can hold (e.g., INTEGER, VARCHAR, DATE).
    • Logical Order of Operations (SQL): The sequence in which SQL clauses are executed (FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT).
    • Boolean Algebra: A system of logic dealing with true and false values, used in SQL for conditional filtering.
    • Set Operations (SQL): Operations that combine or compare result sets from multiple queries (UNION, INTERSECT, EXCEPT).
    • CASE Statement (SQL): A conditional expression that allows for different outputs based on specified conditions.
    • Subquery (SQL): A query nested inside another query.
    • Correlated Subquery (SQL): A subquery that depends on the outer query for its values.
    • Uncorrelated Subquery (SQL): A subquery that does not depend on the outer query.
    • JOIN (SQL): An operation that combines rows from two or more tables based on a related column.
    • INNER JOIN (SQL): Returns only matching rows from both tables.
    • LEFT JOIN (SQL): Returns all rows from the left table and matching rows from the right table.
    • RIGHT JOIN (SQL): Returns all rows from the right table and matching rows from the left table.
    • FULL OUTER JOIN (SQL): Returns all rows from both tables, matching or not.
    • GROUP BY (SQL): A clause that groups rows with the same values in specified columns.
    • Aggregation (SQL): A function that summarizes data (e.g., SUM, AVG, COUNT, MIN, MAX).
    • HAVING (SQL): A clause that filters aggregated data.
    • Window Function (SQL): A function that performs a calculation across a set of table rows that are related to the current row.
    • OVER (SQL): A clause that specifies the window for a window function.
    • PARTITION BY (SQL): A clause that divides the rows into partitions for window functions.
    • ORDER BY (SQL): A clause that specifies the order of rows within a window function.
    • ROWS BETWEEN (SQL): A clause that defines the boundaries of a window.
    • Numbering Functions (SQL): Window functions that assign numbers to rows based on specified criteria (ROW_NUMBER, RANK, DENSE_RANK, NTILE).
    • ROW_NUMBER() (SQL): Assigns a unique sequential integer to each row within a partition.
    • RANK() (SQL): Assigns a rank to each row within a partition based on the order of the rows. Rows with equal values receive the same rank, and the next rank is skipped.
    • DENSE_RANK() (SQL): Similar to RANK(), but assigns consecutive ranks without skipping.
    • NTILE(n) (SQL): Divides the rows within a partition into ‘n’ approximately equal groups, assigning a bucket number to each row.
    • Common Table Expression (CTE): A named temporary result set defined within a SELECT, INSERT, UPDATE, or DELETE statement.

    SQL and BigQuery: A Comprehensive Guide

    Okay, I will analyze the provided text excerpts and create a briefing document that summarizes the main themes, important ideas, and key facts related to SQL and BigQuery. I will include quotes from the original source where appropriate.

    Briefing Document: SQL and BigQuery Fundamentals

    Overview:

    This document summarizes key concepts and functionalities of SQL, specifically within the context of BigQuery. The material covers data organization, query structure, data manipulation, and advanced techniques like window functions and common table expressions. The focus is on understanding the logical order of operations within SQL queries and using this understanding to write efficient and effective code.

    1. Data Organization in BigQuery:

    • Tables: Data is stored in tables, which consist of rows and columns, similar to spreadsheets.
    • “Data in BigQuery and in SQL in general exists in the form of tables and a table looks just like this… it is a collection of rows and columns and it is quite similar to a spreadsheet…”
    • Datasets: Tables are organized into datasets, analogous to folders in a file system.
    • “In order to organize our tables we use data sets… a data set is just that it’s a collection of tables and it’s similar to how a folder works in a file system.”
    • Projects: Datasets belong to projects. BigQuery allows querying data from other projects, including public datasets.
    • “In BigQuery each data set belongs to a project… in Big Query I’m not limited to working with data that leaves in my project I could also from within my project query data that leaves in another project for example the bigquery public data is a project that is not mine…”

    2. Basic SQL Query Structure:

    • Statements: A complete SQL instruction, defining data retrieval and processing.
    • “This is a SQL statement it is like a complete sentence in the SQL language. The statement defines where we want to get our data from and how we want to receive these data including any processing that we want to apply to it…”
    • Clauses: Building blocks of SQL statements (e.g., SELECT, FROM, WHERE, GROUP BY, ORDER BY, LIMIT).
    • “The statement is made up of building block blocks which we call Clauses and in this statement we have a clause for every line… the Clauses that we see here are select from where Group by having order and limit…”
    • Importance of Data Types: Columns have defined data types which dictates the operations that can be performed. SQL tables can be clearly connected with each other.
    • “You create a table and when creating that table you define the schema the schema is the list of columns and their names and their data types you then insert data into this table and finally you have a way to define how the tables are connected with each other…”

    3. Key SQL Concepts:

    • Cost Consideration: BigQuery charges based on the amount of data scanned by a query. Monitoring query size is crucial.
    • “This query will process 1 kilobyte when run so this is very important because here big query is telling you how much data will be scanned in order to give you the results of this query… the amount of data that scanned by the query is the primary determinant of bigquery costs.”
    • Arithmetic Operations: SQL supports combining columns and constants using arithmetic operators and functions.
    • “We are able to combine columns and constants with any sort of arithmetic operations. Another very powerful thing that SQL can do is to apply functions and a function is a prepackaged piece of logic that you can apply to our data…”
    • Aliases: Using aliases (AS) to rename columns or tables for clarity and brevity.
    • Boolean Algebra in WHERE Clause: The WHERE clause uses Boolean logic (AND, OR, NOT) to filter rows based on conditions. Truth tables help understand operator behavior.
    • “The way that these logical statements work is through something called Boolean algebra which is an essential theory for working with SQL… though the name may sound a bit scary it is really easy to understand the fundamentals of Boolean algebra now…”
    • Set Operators (UNION, INTERSECT, EXCEPT): Combining the results of multiple queries using set operations. UNION combines rows, INTERSECT returns common rows, and EXCEPT returns rows present in the first table but not the second. UNION DISTINCT removes duplicate rows, while UNION ALL keeps them.
    • “This command is called Union and not like stack or or something else is is that this is a set terminology right this comes from the mathematical theory of sets… and unioning means combining the values of two sets…”

    4. Advanced SQL Techniques:

    • CASE WHEN Statements: Creating conditional logic to assign values based on specified conditions.
    • “When this condition is true we want to return the value low which is a string a piece of text that says low… all of this that you see here this is the case Clause right or the case statement and all of this is basically defining a new column in my table…”
    • Subqueries: Embedding queries within other queries to perform complex filtering or calculations. Correlated subqueries are slower as they need to be recomputed for each row.
    • “SQL solves this query first gets the result and then plugs that result back back into the original query to get the data we need… on the right we have something that’s called a correlated subquery and on the left we Define this as uncor related subquery…”
    • Common Table Expressions (CTEs): Defining temporary named result sets (tables) within a query for modularity and readability.
    • JOIN Operations: Combining data from multiple tables based on related columns. Types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
    • “A full outer join is like an inner join plus a left join plus a right join…”.
    • GROUP BY and Aggregation: Summarizing data by grouping rows based on one or more columns and applying aggregate functions (e.g., SUM, AVG, COUNT, MIN, MAX). The HAVING clause filters aggregated results.
    • “Having you are free to write filters on aggregated values regardless of the columns that you are selecting…”.
    • Window Functions: Performing calculations across a set of rows that are related to the current row without altering the table structure. They use the OVER() clause to define the window.
    • “Window functions allow us to do computations and aggregations on multiple rows in that sense they are similar to what we have seen with aggregations and group bu the fundamental difference between grouping and window function is that grouping is fundamentally altering the structure of the table…”
    • Numbering Functions (ROW_NUMBER, DENSE_RANK, RANK): Assigning sequential numbers or ranks to rows based on specified criteria.
    • “Numbering functions are functions that we use in order to number the rows in our data according to our needs and there are several numbering functions but the three most important ones are without any doubt row number dense Rank and rank…”

    5. Logical Order of SQL Operations:

    The excerpts emphasize the importance of understanding the order in which SQL operations are performed. This order dictates which operations can “see” the results of previous operations. The general order is:

    1. FROM (Source data)
    2. WHERE (Filter rows)
    3. GROUP BY (Aggregate into groups)
    4. Aggregate Functions (Calculate aggregations within groups)
    5. HAVING (Filter aggregated groups)
    6. Window Functions (Calculate windowed aggregates)
    7. SELECT (Choose columns and apply aliases)
    8. DISTINCT (Remove duplicate rows)
    9. UNION/INTERSECT/EXCEPT (Combine result sets)
    10. ORDER BY (Sort results)
    11. LIMIT (Restrict number of rows)

    6. Postgress SQL Quirk

    Integer Division: When dividing two integers postgress assumes that you you are doing integer Division and returns integer as well. To avoid it, at least one number needs to be floating point number.

    Conclusion:

    The provided text excerpts offer a comprehensive overview of SQL fundamentals and advanced techniques within BigQuery. A strong understanding of data organization, query structure, the logical order of operations, and the various functions and clauses available is crucial for writing efficient and effective SQL code. Mastering these concepts will enable users to extract valuable insights from their data and solve complex analytical problems.

    BigQuery and SQL: Data Management, Queries, and Functions

    FAQ on SQL and Data Management with BigQuery

    1. How is data organized in BigQuery and SQL in general?

    Data in BigQuery is organized in a hierarchical structure. At the lowest level, data resides in tables. Tables are collections of rows and columns, similar to spreadsheets. To organize tables, datasets are used, which are collections of tables, analogous to folders in a file system. Finally, datasets belong to projects, providing a top-level organizational unit. BigQuery also allows querying data from public projects, expanding access beyond a single project.

    2. How does BigQuery handle costs and data limits?

    BigQuery’s costs are primarily determined by the amount of data scanned by a query. Within the sandbox program, users can scan up to one terabyte of data each month for free. It’s important to check the amount of data that a query will process before running it, especially with large tables, to avoid unexpected charges. The query interface displays this information before execution.

    3. What are the fundamental differences between SQL tables and spreadsheets?

    While both spreadsheets and SQL tables store data in rows and columns, key differences exist. Spreadsheets are typically disconnected, whereas SQL provides mechanisms to define connections between tables. This allows relating data across multiple tables through defined schemas, specifying column names and data types. SQL also enforces a logical order of operations, which dictates the order in which the various parts of a query are executed.

    4. How are calculations and functions used in SQL queries?

    SQL allows performing calculations using columns and constants. Common arithmetic operations are supported, and functions, pre-packaged logic, can be applied to data. The order of operations in SQL follows standard arithmetic rules: brackets first, then functions, multiplication and division, and finally addition and subtraction.

    5. What are Clauses in SQL, and how are they used?

    SQL statements are constructed from building blocks known as Clauses. Key clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT. Clauses define where the data comes from, how it should be processed, and how the results should be presented. The clauses are assembled to form a complete SQL statement. The order in which you write the clauses is less important than the logical order in which they are executed, which is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY and LIMIT.

    6. How do the WHERE clause and Boolean algebra work together to filter data in SQL?

    The WHERE clause is used to filter rows based on logical conditions. These conditions rely on Boolean algebra, which uses operators like NOT, AND, and OR to create complex expressions. Understanding the order of operations within Boolean algebra is crucial for writing effective WHERE clauses. NOT is evaluated first, then AND, and finally OR.

    7. What are set operations in SQL, and how are they used?

    SQL provides set operations like UNION, INTERSECT, and EXCEPT to combine or compare the results of multiple queries. UNION combines rows from two or more tables, with UNION DISTINCT removing duplicate rows and UNION ALL keeping all rows, including duplicates. INTERSECT DISTINCT returns only the rows that are common to both tables. EXCEPT DISTINCT returns rows from the first table that are not present in the second table.

    8. How can window functions be used to perform calculations across rows without altering the structure of the table?

    Window functions perform calculations across a set of table rows related to the current row, without grouping the rows like GROUP BY. They are defined using the OVER() clause, which specifies the window of rows used for the calculation. Window functions can perform aggregations, ordering, and numbering within the defined window, adding insights without collapsing the table’s structure. Numbering functions include ROW_NUMBER, RANK, and DENSE_RANK. Numbering functions are often used in conjunction with Partition By and Order By which can divide data into logical partitions in which to number results. Ranking functions, when used with PARTITION BY and ORDER BY can define a rank, for instance, for each race result, ordered fastest to slowest. They can then be further filtered with use of a CTE, a Common Table Expression.

    SQL Data Types and Schemas

    In SQL, a data model is defined by the name of columns and the data type that each column will contain.

    • Definition: The schema of a table includes the name of each column in the table and the data type of each column. The data type of a column defines the type of operations that can be done to the column.
    • Examples of data types:
    • Integer: A whole number.
    • Float: A floating point number.
    • String: A piece of text.
    • Boolean: A value that is either true or false.
    • Timestamp: A value that represents a specific point in time.
    • Interval: A data type that specifies a certain span of time.
    • Data types and operations: Knowing the data types of columns is important because it allows you to know which operations can be applied. For example, you can perform mathematical operations such as multiplication or division on integers or floats. For strings, you can change the string to uppercase or lowercase. For timestamps, you can subtract a certain amount of time from that moment.

    SQL Tables: Structure, Schema, and Operations

    In SQL, data exists in the form of tables. Here’s what you need to know about SQL tables:

    • StructureA table is a collection of rows and columns, similar to a spreadsheet.
    • Each row represents an entry, and each column represents an attribute of that entry. For example, in a table of fantasy characters, each row may represent a character, and each column may represent information about them such as their ID, name, class, or level.
    • SchemaEach SQL table has a schema that defines the columns of the table and the data type of each column.
    • The schema is assumed as a given when working in SQL and is assumed not to change over time.
    • OrganizationIn SQL, tables are organized into data sets.
    • A data set is a collection of tables and is similar to a folder in a file system.
    • In BigQuery, each data set belongs to a project.
    • Table IDThe table ID represents the full address of the table.
    • The address is made up of three components: the ID of the project, the data set that contains the table, and the name of the table.
    • Connections between tablesSQL allows you to define connections between tables.
    • Tables can be connected with each other through arrows. These connections indicate that one of the tables contains a column with the same data as a column in another table, and that the tables can be joined using those columns to combine data.
    • Table operations and clausesFROM: indicates the table from which to retrieve data.
    • SELECT: specifies the columns to retrieve from the table.
    • WHERE: filters rows based on specified conditions.
    • DISTINCT: removes duplicate rows from the result set.
    • UNION: stacks the results from multiple tables.
    • ORDER BY: sorts the result set based on specified columns.
    • LIMIT: limits the number of rows returned by the query.
    • JOIN: combines rows from two or more tables based on a related column.
    • GROUP BY: groups rows with the same values in specified columns into summary rows.

    SQL Statements: Structure, Clauses, and Operations

    Here’s what the sources say about SQL statements:

    General Information

    • In SQL, a statement is like a complete sentence that defines where to get data and how to receive it, including any processing to apply.
    • A statement is made up of building blocks called clauses.
    • Query statements allow for retrieving, analyzing, and transforming data.
    • In this course, the focus is exclusively on query statements.

    Components and Structure

    • Clauses are assembled to build statements.
    • There is a specific order to writing clauses; writing them in the wrong order will result in an error.
    • Common clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.

    Order of Execution

    • The order in which clauses are written (lexical order) is not the same as the order in which they are executed (logical order).
    • The logical order of execution is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT.
    • The actual order of execution (effective order) may differ from the logical order due to optimizations made by the SQL engine. The course focuses on mastering the lexical order and the logical order.

    Clauses and their Function

    • FROM: Specifies the table from which to retrieve the data. It is always the first component in the logical order of operations because you need to source the data before you can work with it.
    • SELECT: Specifies which columns of the table to retrieve. It allows you to get any columns from the table in any order. You can also use it to rename columns, define constant columns, combine columns in calculations, and apply functions.
    • WHERE: Filters rows based on specified conditions. It follows right after the FROM clause in the logical order. The WHERE clause can reference columns of the tables, operations on columns, and combinations between columns.
    • DISTINCT: removes duplicate rows from the result set.

    Combining statements

    • UNION allows you to stack the results from two or more tables. In BigQuery, you must specify UNION ALL to include duplicate rows or UNION DISTINCT to only include unique rows.
    • INTERSECT returns only the rows that are shared between two tables.
    • EXCEPT returns all of the elements in one table except those that are shared with another table.
    • For UNION, INTERSECT, and EXCEPT, the tables must have the same number of columns, and the columns must have the same data types.

    Subqueries

    • Subqueries are nested queries used to perform complex tasks that cannot be done with a single query.
    • A subquery is a piece of SQL logic that returns a table.
    • Subqueries can be used in the FROM clause instead of a table name.

    Common Table Expressions (CTEs)

    • CTEs are virtual tables defined within a query that can be used to simplify complex queries and improve readability.
    • CTEs are defined using the WITH keyword, followed by the name of the table and the query that defines it.
    • CTEs can be used to build data pipelines within SQL code.

    SQL Logical Order of Operations

    Here’s what the sources say about the logical order of operations in SQL:

    Basics

    • The order in which clauses are written (lexical order) is not the order in which they are executed (logical order).
    • Understanding the logical order is crucial for accelerating learning SQL.
    • The logical order helps in building a powerful mental model of SQL that allows tackling complex and tricky problems.

    The Logical Order

    • The logical order of execution is: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, and finally LIMIT.
    • The JOIN clause is not really separate from the FROM clause; they are the same component in the logical order of operations.

    Rules for Understanding the Schema

    • Operations are executed sequentially from left to right.
    • Each operation can only use data that was produced by operations that came before it.
    • Each operation cannot know anything about data that is produced by operations that follow it.

    Implications of the Logical Order

    • FROM is the very first component in the logical order of operations because the data must be sourced before it can be processed. The FROM clause specifies the table from which to retrieve the data. The JOIN clause is part of this step, as it defines how tables are combined to form the data source.
    • WHERE Clause follows right after the FROM Clause. After sourcing the data, the next logical step is to filter the rows that are not needed. The WHERE clause drops all the rows that are not needed, so the table becomes smaller and easier to deal with.
    • GROUP BY fundamentally alters the structure of the table. The GROUP BY operation compresses down the values; in the grouping field, a single row will appear for each distinct value, and in the aggregate field, the values will be compressed or squished down to a single value as well.
    • SELECT determines which columns to retrieve from the table. The SELECT clause is where new columns are defined.
    • ORDER BY sorts the result of the query. Because the ordering occurs so late in the process, SQL knows the final list of rows that will be included in the results, which is the right moment to order those rows.
    • LIMIT is the very last operation. After all the logic of the query is executed and all data is computed, the LIMIT clause restricts the number of rows that are output.

    Window Functions and the Logical Order

    • Window functions operate on the result of the GROUP BY clause, if present; otherwise, they operate on the data after the WHERE filter is applied.
    • After applying the window function, the SELECT clause is used to choose which columns to show and to label them.

    Common Errors

    • A common error is to try to use LIMIT to make a query cheaper. The LIMIT clause does not reduce the amount of data that is scanned; it only limits the number of rows that are returned.
    • Another common error is to violate the logical order of operations. For example, you cannot use a column alias defined in the SELECT clause in the WHERE clause because the WHERE clause is executed before the SELECT clause.
    • In Postgres, you cannot use the labels that you assign to aggregations in the HAVING clause.

    Boolean Algebra: Concepts, Operators, and SQL Application

    Here’s what the sources say about Boolean algebra:

    Basics

    • Boolean algebra is essential for working with SQL and other programming languages.
    • It is fundamental to how computers work.
    • It is a simple way to understand the fundamentals.

    Elements

    • In Boolean algebra, there are only two elements: true and false.
    • A Boolean field in SQL is a column that can only have these two values.

    Operators

    • Boolean algebra has operators that transform elements.
    • The three most important operators are NOT, AND, and OR.

    Operations and Truth Tables

    • In Boolean algebra, operations combine operators and elements and return elements.
    • To understand how a Boolean operator works, you have to look at its truth table.

    NOT Operator

    • The NOT operator works on a single element, such as NOT TRUE or NOT FALSE.
    • The negation of p is the opposite value.
    • NOT TRUE is FALSE
    • NOT FALSE is TRUE

    AND Operator

    • The AND operator connects two elements, such as TRUE AND FALSE.
    • If both elements are true, then the AND operator will return true; otherwise, it returns false.

    OR Operator

    • The OR operator combines two elements.
    • If at least one of the two elements is true, then the OR operator returns true; only if both elements are false does it return false.

    Order of Operations

    • There is an agreed-upon order of operations that helps solve complex expressions.
    • The order of operations is:
    1. Brackets (solve the innermost brackets first)
    2. NOT
    3. AND
    4. OR

    Application in SQL

    • A complex logical statement that is plugged into the WHERE filter isolates only certain rows.
    • SQL converts statements in the WHERE filter to true or false, using values from a row.
    • SQL uses Boolean algebra rules to compute a final result, which is either true or false.
    • If the result computes as true for the row, then the row is kept; otherwise, the row is discarded.

    Example

    To solve a complex expression, such as NOT (TRUE OR FALSE) AND (FALSE OR TRUE), proceed step by step:

    1. Solve the innermost brackets:
    • TRUE OR FALSE is TRUE
    • FALSE OR TRUE is TRUE
    1. The expression becomes: NOT (TRUE) AND (TRUE)
    2. Solve the NOT:
    • NOT (TRUE) is FALSE
    1. The expression becomes: FALSE AND TRUE
    2. Solve the AND:
    • FALSE AND TRUE is FALSE
    1. The final result is FALSE
    Intuitive SQL For Data Analytics – Tutorial
    Data Analytics FULL Course for Beginners to Pro in 29 HOURS – 2025 Edition

    The Original Text

    learn SQL for analytics Vlad is a data engineer and in this course he covers both the theory and the practice so you can confidently solve hard SQL challenges on your own no previous experience required and you’ll do everything in your browser using big query hi everyone my name is Vlad and I’m a date engineer welcome to intuitive SQL for analytics this here is the main web page for the course you will find it in the video description and this will get updated over time with links and resources so be sure to bookmark it now the goal of this course is to quickly enable you to use SQL to analyze and manipulate data this is arguably the most important use case for SQL and the Practical objective is that by the end of this course you should be able to confidently solve hard SQL problems of the kind that are suggested during data interviews the course assumes no previous knowledge of SQL or programming although it will be helpful if you’ve work with spreadsheets such as Microsoft Excel or Google Sheets because there’s a lot of analogies between manipulating data in spreadsheets and doing it in SQL and I also like to use spreadsheets to explain SQL Concepts now there are two parts to this course theory and practice the theory part is a series of short and sweet explainers about the fundamental concepts in SQL and for this part we will use Google bigquery bigquery which you can see here is a Google service that allows you to upload your own data and run SQL on top of it so in the course I will teach you how to do that and how to do it for free you won’t have to to spend anything and then we will load our data and we will run SQL code and besides this there will be drawings and we will also be working with spreadsheets and anything it takes to make the SQL Concepts as simple and understandable as possible the practice part involves doing SQL exercises and for this purpose I recommend this website postest SQL exercises this is a free and open-source website where you will find plenty of exercises and you will be able to run SQL code to solve these exercises check your answer and then see a suggested way to do it so I will encourage you to go here and attempt to solve these exercises on your own however I have also solved 42 of these exercises the most important ones and I have filmed explainers where I solve the exercise break it apart and then connect it to the concepts of the course so after you’ve attempted the exercise you will be able to see me solving it and connect it to the rest of the course so how should you take this course there are actually many ways to do it and you’re free to choose the one that works best if you are a total beginner I recommend doing the following you should watch the theory lectures and try to understand everything and then once you are ready you should attempt to do the exercises on your own on the exercise uh website that I’ve shown you here and if you get stuck or after you’re done you can Watch How I solved the exercise but like I said this is just a suggestion and uh you can combine theory and practice as you wish and for example a more aggressive way of doing this course would be to jump straight into the exercises and try to do them and every time that you are stuck you can actually go to my video and see how I solved the exercise and then if you struggle to understand the solution that means that maybe there’s a theoretical Gap and then you can go to the theory and see how the fundamental concepts work so feel free to experiment and find the way that works best for you now let us take a quick look at the syllabus for the course so one uh getting started this is a super short explainer on what SQL actually is and then I teach you how to set up bigquery the Google service where we will load our data and run SQL for the theory part the second uh chapter writing your first query so here I explained to you how big query works and how you can use it um and how you are able to take your own data and load it in big query so you can run SQL on top of it and at the end of it we finally run our first SQL query chapter 3 is about exploring some ESS IAL SQL Concepts so this is a short explainer of how data is organized in SQL how the SQL statement Works meaning how we write code in SQL and here is actually the most important concept of the whole course the order of SQL operations this is something that is not usually taught properly and a lot of beginners Miss and this causes a lot of trouble when you’re you’re trying to work with SQL so once you learn this from the start you will be empowered to progress much faster in your SQL knowledge and then finally we get into the meat of the course this is where we learn all the different components in SQL how they work and how to combine them together so this happens in a few phases in the first phase we look at the basic components of SQL so these are uh there’s a few of them uh there’s select and from uh there’s learning how to transform columns the wear filter the distinct Union order by limit and then finally we see how to do simple aggregations at the end of this part you will be empowered to do the first batch of exercises um don’t worry about the fact that there’s no links yet I will I will add them but this is basically involves going to this post SQL exercises website and going here and doing this uh first batch of exercises and like I said before after you’ve done the exercises you can watch the video of me also solving them and breaking them down next we take a look at complex queries and this involves learning about subqueries and Common Table expressions and then we look at joining tables so here is where we understand how SQL tables are connected uh with each other and how we can use different types of joints to bring them together and then you are ready for the second batch of exercises which are those that involve joints and subqueries and here there are eight exercises the next step is learning about aggregations in SQL so this involves the group bu the having and window functions and then finally you are ready for the final batch of exercises which actually bring together all the concepts that we’ve learned in this course and these are 22 exercises and like before for each exercise you have a video for me solving it and breaking it apart and then finally we have the conclusion in the conclusion we see how we can put all of this knowledge together and then we take a look at how to use this knowledge to actually go out there and solve SQL challenges such as the ones that are done in data interviews and then here you’ll find uh all the resources that are connected to the course so you have the files with our data you have the link to the spreadsheet that we will use the exercises and all the drawings that we will do this will definitely evolve over over time as the course evolves so bookmark this page and keep an eye on it that was that was all you needed to know to get started so I will see you in the course if you are working with SQL or you are planning to work with SQL you’re certainly a great company in the 2023 developer survey by stack Overflow there is a ranking of the most popular Technologies out there if we look at professional developers where we have almost 70,000 responses we can see that SQL is ranked as the third most popular technology SQL is certainly one of the most in demand skills out there not just for developers but for anyone who works with data in any capacity and in this course I’m going to help you learn SQL the way I wish I would have learned it when I started out on my journey since this is a practical course we won’t go too deep into the theory all you need to know for our purposes is that SQL is a language for working with data like most languages SQL has several dialects you may have heard of post SQL or my sqil for example you don’t need to worry about these dialects because they’re all very similar so if you learn SQL in any one of the dialects you’ll do well on all the others in this course we will be working with B query and thus we will write SQL in the Google SQL dialect here is the documentation for Google big query the service that we will use to write SQL code in this course you can see that big query uses Google SQL a dialect of SQL which is an compliant an compliant means that Google SQL respects the generally recognized standard for creating SQL dialects and so it is highly compatible with with all other common SQL dialects as you can read here Google SQL supports many types of statements and statements are the building blocks that we use in order to get work done with SQL and there are several types of statements listed here for example query statements allow us to retrieve and analyze and transform data data definition language statements allow us to create and modify database objects such as tables and Views whereas data manipulation language statements allows us to update and insert and delete data from our tables now in this course we focus exclusively on query statements statements that allow us to retrieve and process data and the reason for this is that if you’re going to start working with big query you will most likely start working with this family of statements furthermore query statements are in a sense the foundation for all other families of statements so if you understand uh query statements you’ll have no trouble learning the others on your own why did I pick big query for this course I believe that the best way to learn is to load your own data and follow questions that interest you and play around with your own projects and P query is a great tool to do just that first of all it is free at least for the purposes of learning and for the purposes of this course it has a great interface that will give you U really good insights into your data and most importantly it is really easy to get started you don’t have to install anything on your computer you don’t have to deal with complex software you just sign up for Google cloud and you’re ready to go and finally as you will see next big query gives you many ways to load your own data easily and quickly and get started writing SQL right away I will now show you how you can sign up for Google cloud and get started with bigquery so it all starts with this link which I will share in the resources and this is the homepage of Google cloud and if you don’t have an account with Google Cloud you can go here and select sign in and here you need to sign in with your Google account which you probably have but if you don’t you can go here and select create account so I have now signed in with my Google account which you can see here in the upper right corner and now I get a button that says start free so I’m going to click that and now I get taken to this page and on the right you see that the first time you sign up for Google Cloud you get $300 of free credits so that you can try the services and that’s pretty neat and here I have to enter some extra information about myself so I will keep it as is and agree to the terms of service and continue finally I need to do the payment information verification so unfortunately this is something I need to do even though I’m not going to be charged for the services and this is for Google to be able to verify my my identity so I will pick individual as account type and insert my address and finally I need to add a payment method and again uh I need to do this even though I’m not going to pay I will actually not do it here because I don’t intend to sign up but after you are done you can click Start my free trial and then you should be good to go now your interface may look a bit different but essentially after you’ve signed up for Google Cloud you will need to create a project and the project is a tool that organizes all your work in Google cloud and essentially every work that you do in Google cloud has to happen inside a specific project now as you can see here there is a limited quota of projects but that’s not an issue because we will only need one project to work in this course and of course creating a new project is totally free so I will go ahead and give it a name and I don’t need any organization and I will simply click on create once that’s done I can go back back to the homepage for Google cloud and here as you can see I can select a project and here I find the project that I have created before and once I select it the rest of the page won’t change but you will see the name of the project in the upper bar here now although I’ve created this project as an example for you for the rest of the course you will see me working within this other project which was the one that I had originally now I will show you how you can avoid paying for Google cloud services if you don’t want to so from the homepage you have the search bar over here and you can go here and write billing and click payment overview to go to the billing service now here on the left you will see your billing account account which could be called like this or have another name and clicking here I can go to manage billing accounts now here I can go to my projects Tab and I see a list of all of my projects in Google cloud and a project might or might not be connected to a billing account if a project is not connected to a billing account then then Google won’t be able to charge you for this project although keep in mind that if you link your project with a billing account and then you incur some expenses if you then remove the billing account you will still owe Google Cloud for those uh expenses so what I can do here is go to my projects and on actions I can select disabled building in case I have a billing account connected now while this is probably the shest way to avoid incurring any charges you will see that you will be severely limited in what you can do in your project if that project is not linked to any billing account however you should still be able to do most of what you need to do in B query at least for this course and we can get more insight into how that works by by going to the big query pricing table so this page gives us an overview of how pricing works for big query I will not analyze this in depth but what you need to know is that when you work with bigquery you can fundamentally be charged for two things one is compute pricing and this basically means all the data that bigquery scans in order to return the results that you need when you write your query and then you have storage pricing which is the what you pay in order to store your data inside bigquery now if I click on compute pricing I will go to the pricing table and here you can select the region that uh most reflects where you are located and I have selected Europe here and as you can see you are charged $625 at the time of this video for scanning a terabyte of data however the first terabyte per month is free so every month you can write queries that scan one terabyte of data and not pay for them and as you will see more in detail this is more than enough for what we will be doing in this course and also for for what you’ll be doing on your own in order to experiment with SQL and if I go back to the top of the page and then click on storage pricing you can see here that again you can select your region and see um several pricing uh units but here you can see that the first 10 gab of storage per month is free so you can put up to 10 gigabytes of data in B query and you won’t need a billing account you won’t pay for storage and this is more than enough for our needs in order to learn SQL in short bigquery gives us a pretty generous free allowance for us to load data and play with it and we should be fine however I do urge you to come back to this page and read it again because things may have changed since I recorded this video video to summarize go to the billing service check out your billing account and you have the option to decouple your project from the billing account to avoid incurring any charges and you should still be able to use B query but as a disclaimer I cannot guarantee that things will work just the same uh at the time that you are watching this video so be sure to check the documentation or maybe discuss with Google Cloud support to um avoid incurring any unexpected expenses please do your research and be careful in your usage of these services for this course I have created an imaginary data set with the help of chat GPT the data set is about a group of fantasy characters as well as their items and inventories I then proceed proed to load this data into bigquery which is our SQL system I also loaded it into Google Sheets which is a spreadsheet system similar to Microsoft Excel this will allow me to manipulate the data visually and help you develop a strong intuition about SQL operations I’m going to link a separate video which explains how you can also use chat PT to generate imaginary data according to your needs and then load this data in Google Sheets or bigquery I will also link the files for this data in the description which you can use to reproduce this data on your side next I will show you how we can load the data for this course into bigquery so I’m on the homepage of Google cloud and I have a search bar up here and I can write big query and select it from here and this will take me to the big query page now there is a panel on the left side that appears here if I hover or it could be fixed and this is showing you several tools that you can use within bigquery and you can see that we are in the SQL workspace and this is actually the only tool that we will need for this course so if you if you’re seeing this panel on the left I recommend going to this arrow in the upper left corner and clicking it so you can disable it and make more room for yourself now I want to draw your attention to the Explorer tab which shows us where our data is and how it is organized so I’m going to expand it here now data in bigquery and in SQL in general exists in the form of tables and a table looks just like this as you can see here the customer’s table it is a collection of rows and columns and it is quite similar to a spreadsheet so this will be familiar to you if you’ve ever worked with Microsoft Excel or Google Sheets or any spreadsheet program so your data is actually living in a table and you could have as many tables as you need in B query there could be quite a lot of them so in order to organize our tables we use data sets for example in this case my data is a data set which contains the table customers and employee data and a data set is is just that it’s a collection of tables and it’s similar to how a folder Works in a file setem system it is like a for folder for tables finally in bigquery each data set belongs to a project so you can see here that we have two data sets SQL course and my data and they both belong to this project idelic physics and so on and this is actually the ID of my project this is the ID of the project that I’m working in right now the reason the Explorer tab shows the project as well is that in big query I’m not limited to working with data that leaves in my project I could also from within my project query data that leaves in another project for example the bigquery public data is a project that is not mine but it’s actually a public project by bigquery and if I expand this you will see that it contains a collection of of several data sets which are in themselves um collections of tables and I would be able to query these uh tables as well but you don’t need to worry about that now because in this course we will only focus on our own data that lives in our own project so this in short is how data is organized in big query now for the purpose of this course I recommend creating a new data set so so that our tables can be neatly organized and to do that I can click the three dots next to the project uh ID over here and select create data set and here I need to pick a name for the data set so I will call this fantasy and I suggest you use the same name because if you do then the code that I share with you will work immediately then as for the location you can select the multi region and choose the region that is closest to you and finally click on create data set so now the data set fantasy has been created and if I try to expand it here I will see that it is empty because I haven’t loaded any data yet the next step is to load our tables so I assume that you have downloaded the zip file with the tables and extracted it on your local computer and then we can select the action point here next to the fantasy data set and select create table now as a source I will select upload and here I will click on browse and access the files that I have downloaded and I will select the first table here here which is the characters table the file format is CSV so Google has already understood that and scrolling down here I need to choose a name for my table so I will call it just like the file uh which is characters and very important under schema I need to select autodetect and we will see what this means in a bit but basically this is all we need so now I will select create table and now you will see that the characters table has appeared under the fantasy data set and if I click on the table and then go on preview I will should be able to see my data I will now do the same for the other two tables so again create table source is upload file is inventory repeat the name and select autod detect and I have done the same with the third table so at the end of this exercise the fantasy data set should have three tables and you can select them and go on preview to make sure that the data looks as expected now our data is fully loaded and we are ready to start querying it within big query now let’s take a look at how the bigquery interface works so on the left here you can see the Explorer which shows all the data that I have access to and so to get a table in big query first of all you open the name of the project and then you look at the data sets that are available within this project you open a data set and finally you see a table such as characters and if I click now on characters I will open the table view now in the table view I will find a lot of important information about my table in these tabs over here so let’s look at the first tab schema the schema tab shows me the structure of my table which as we shall see is very important and the schema is defined essentially by two things the name of each column in my table and the data type of each column so here we see that the characters table contains a few columns such as ID name Guild class and so on and these columns have different data types for example ID is an integer which means that it contains natural numbers whereas name is string which means that it contains text and as we shall see the schema is very important because it defines what you can do with the table and next we have the details tab which contains a few things first of all is the table ID and this ID represents the full address of the table and this address is made up of three components first of all you have the ID of the project which is as you can see the project in which I’m working and it’s the same that you see here on the left in the Explorer tab the next component is the data set that contains the table and again you see it in the Explorer Tab and finally you have the name of the table this address is important because it’s what we use to reference the table and it’s what we use to get data from this table and then we see a few more things about the table such as when it was created when it was last modified and here we can see the storage information so we can see here that this table has 15 rows and on the dis it occupies approximately one kilobyte if you work extensively with P query this information will be important for two reasons number one it defines how much you are paying every month to store this table and number two it defines how much you would pay for a query that scans all the data in this table and as we have seen in the lecture on bigquery pricing these are the two determinants of bigquery costs however for the purpose of this course you don’t need to worry about this because the tables we are working with are so small that they won’t put a dent in your free month monthly allowance for using big query next we have the preview tab which is really cool to get a sense of the data and this basically shows you a graphical representation of your table and as you will notice it looks very similar to a spreadsheet so you can see our columns the same ones that we saw in the schema tab ID name Guild and so on and as you remember we saw that ID is an integer column so you can only contain numbers name is a text column and then you see that this table has 15 rows and because it’s such a small table all of it fits into this graphical representation but in the real world you may have tables with millions of rows and in this case the preview will show you only a small portion of that table table but still enough to get a good sense of the data now there are a few more tabs in the table view we have lineage data profile data quality but I’m not going to look at them now because they are like Advanced features in bigquery and you won’t need them in this course instead I will run a very basic query on this table and this is not for the purpose of understanding query that will come soon it is for the purpose of showing you what the interface looks like after you run a query so I have a very basic query here that will run on my table and you can see that the interface is telling me how much data this query will process and this is important because this is the main determinant of cost in bigquery every query scans a certain amount of data and you have to pay for that but as we saw in the lecture of bigquery pricing this table is so small that you could run a million or more of these queries and not exhaust your monthly allowance so if you see 1 kilobyte you don’t have to worry about that so now I will click run and my query will execute and here I get the query results view this is the view that that appears after you have successfully run a query so we have a few tabs here and the first step that you see is results and this shows you graphically the table that was returned by your query so as we shall see every query in SQL runs on a table and returns a table and just like the preview tab showed you a graphical view of your table the results tab shows you a graphical view of the table that your query has returned and this is really the only tab in the query results view that you will need on this course the other ones show different features or more advanced features that we won’t look at but feel free to explore them on your own if you are curious but what’s also important in this view is this button over here save results which you can use to EXP report the result of your query towards several different destinations such as Google drive or local files on your computer in different formats or another big query table a spreadsheet in Google Sheets or even copying them to your clipboard so that you can paste them somewhere else but we shall discuss this more in detail in the lecture on getting data in and out of big query finally if you click on this little keyboard icon up here you can see a list of shortcuts that you can use in the big query interface and if you end up running a lot of queries and you want to be fast this is a nice way to improve your experience with big query so be sure to check these out we are finally ready to write our first query and in the process we will keep exploring the Fantastic bigquery interface so one way to get started would be to click this plus symbol over here so that we can open a new tab now to write the query the first thing I will do is to tell big query where the data that I want leaves and to do that I will use the from Clause so I will simply write from and my data lives in the fantasy data set and in the characters table next I will tell SQL what data I actually want from this table and the simplest thing to ask for is to get all the data and I can do this by writing select star now my query is ready and I can either click run up here or I can press command enter on my Mac keyboard and the query will run and here I get a new tab which shows me the results now the results here are displayed as a table just as uh we saw in the preview tab of the table and I can get an idea of uh my results and this is actually the whole table because this is what I asked for in the query there are also other ways to see the results which are provided by bigquery such as Json which shows the same data but in a different format but we’re not going to be looking into that for this course one cool option that the interface provides is if I click on this Arrow right here in my tab I can select split tab to right and now I have a bit of less room in my interface but I am seeing the table on the left and the query on the right so that I can look at the structure of the table while writing my query for example if I click on schema here I could see which columns I’m able to um reference in my query and that can be pretty handy I could also click this toggle to close the Explorer tab temporarily if I don’t need to look look at those tables so I can make a bit more room or I can reactivate it when needed I will now close this tab over here go back to the characters table and show you another way that I can write a query which is to use this query command over here so if I click here I can select whether I want my query in a new tab or in a split tab let let me say in new tab and now bigquery has helpfully uh written a temp template for a query that I can easily modify in order to get my data and to break down this template as you can see we have the select Clause that we used before we have the from clause and then we have a new one called limit now the from Clause is doing the same job as before it is telling query where we want to get our data but you will notice that the address looks a bit different from the one that I had used specifically I used the address fantasy. characters so what’s happening here is that fantasy. characters is a useful shorthand for the actual address of the table and what we see here that big query provided is the actual full address of the table or in other words it is the table ID and as you remember the table ID indicates the project ID the data set name and the table name and importantly this ID is usually enclosed by back ticks which are a quite specific character long story short if you want to be 100% sure you can use the full address of the table and bigquery will provide it for you but if you are working within the same project where the data lives so you don’t need to reference the project you can also use this shorthand here to make writing the address easier and in this course I will use these two ways to reference a table interchangeably I will now keep the address that bigquery provided now the limit statement as we will see is simply limiting the number of rows that will be returned by this query no more than 1,000 rows will be returned and next to the select we have to say what data we want to get from this table and like before I can write star and now my query will be complete before we run our query I want to draw your attention to this message over here this query will process 1 kilobyte when run so this is very important because here big query is telling you how much data will be scanned in order to give you the results of this query in this case we are returning um all the data in the table therefore all of the table will be scanned and actually limit does not have any influence on that it doesn’t reduce how much data is scanned so this query will scan 1 kilobyte of data and the amount of data that scanned by the query is the primary determinant of bigquery costs now as you remember we are able to scan up to one terabyte of data each month within the sandbox program and if we wanted to scan more data then we would have to pay so the question is how many of these queries could we run before running out of our free allowance well to answer that we could check how many kilobytes are in a terabyte and if you Google this the conversion says it’s one to um multipli by 10 to the power of 9 which ends up being 1 billion therefore we can run 1 billion of these queries each month before running out of our allowance now you understand why I’ve told you that as long as you work with small tables you won’t really run out of your allowance and you don’t really have to worry about costs however here’s an example of a query that will scan a large amount of data and what I’ve done here is I’ve taken one of the public tables provided by big query which I’ve seen to be quite large and I have told big query to get me all the data for this table and as you can see here big query says that 120 gabt of data will be processed once this query runs now you would need about eight of these queries to get over your free allowance and if you had connected to B query you could also be charged money for any extra work that you do so be very careful about this and if you work with large tables always check this message over here before running the query and remember you won’t actually be charged until you actually hit run on the query and there you have it we learned how the big query interface works and wrote our first SQL query it is important that we understand how data is organized in SQL so we’ve already seen a a preview of the characters table and we’ve said that this is quite similar to how you would see data in a spreadsheet namely you have a table which is a collection of rows and columns and then in this case on every row you have a character and for every character you have a number of information points such as their ID their name their class level and so on the first fundamental difference with the spreadsheet is that if I want to have some data in a spreadsheet I can just open a new one and uh insert some data in here right so ID level name and so on then I could say that I have a character id one who is level 10 and his name is Gandalf and this looks like the data I have in SQL and I can add some more data as well well a new character id 2 level five and the name is frao now I will save this spreadsheet and then some days later someone else comes in let’s say a colleague and they want to add some new data and they say oh ID uh is unknown level is um 20.3 and the name here and then I also want to uh show their class so I will just add another column here and call this Mage now spreadsheets are of course extremely flexible because you can always um add another column and write in more cells and you can basically write wherever you want but this flexibility comes at a price because the more additions we make to this uh to the data model that is represented here the more complex it will get with time and the more likely it will be that we make confusions or mistakes which is what actually happens in real life when people work with spreadsheets SQL takes a different approach in SQL before we insert any actual data we have to agree on the data model that we are going to use and the data model is essentially defined by two elements the name of our columns and the data type that each column will contain for example we can agree that we will use three columns in our table ID level and name and then we can agree that ID will be an integer meaning that it will contain contain whole numbers level will be a integer as well and name will be a string meaning that it contains text now that we’ve agreed on this structure we can start inserting data on the table and we have a guarantee that the structure will not change with time and so any queries that we write on top of this table any sort of analysis that we create for this table will also be durable in time because it will have the guarantee that the data model of the table will not change and then if someone else comes in and wants to insert this row they will actually not be allowed to first of all because they are trying to insert text into an integer column and so they’re violating the data type of the column and they are not allowed to do that in level they are also violating the data type of the column because this column only accepts whole numbers and they’re trying to put a floating Point number in there and then finally there are also violating the column definition because they’re they’re trying to add a column class that was not actually included in our data model and that we didn’t agree on so the most important difference between spreadsheets and SQL is that for each SQL table you have a schema and as we’ve seen before the schema defines exactly which columns our table has and what is the data type of each column so in this case for the characters table we have several columns uh and here we can see their names and then each column has a specific data types and all the most important data types are actually represented here specifically by integer we mean a whole number and by float we mean a floating Point number string is a piece of text Boolean is a value that is either true or false and time stamp is a value that represents a specific point in time all of this information so the number of columns the name of each column and the type of each column they constitute the schema of the table and like we’ve said the schema is as assumed as a given when working in SQL and it is assumed that will not change over time now in special circumstances there are ways to alter the schema of a table but it is generally assumed as a given when writing queries and we shall do the same in this course and why is it important to keep track of the data type why is it important to distinguish between integer string Boolean the simple answer is that the data type defines the type of operations that you you can do to a column for example if you have an integer or a float you can multiply the value by two or divide it and so on if you have a string you can turn that string to uppercase or lowercase if you have a time stamp you can subtract 30 days from that specific moment in time and so on so by looking at the data type you can find out what type of work you can do with a column the second fundamental difference from spreadsheets is that spreadsheets are usually disconnected but SQL has a way to define connections between tables so what we see here is a representation of our three tables and for each table you can see the schema meaning the list of columns and their types but the extra information that we have here is the connection between the tables so you can see that the inventory table is connected to the items table and also to the character table moreover the characters table is connected with itself now we’re not going to explore this in depth now because I don’t want to add too much Theory we will see this in detail in the chapter on joints but it is a fundamental difference from spreadsheets that SQL tables can be clearly connected with each other and that’s basically all you need to understand how data is organized in SQL for now you create a table and when creating that table you define the schema the schema is the list of columns and their names and their data types you then insert data into this table and finally you have a way to define how the tables are connected with each other I will now show you how SQL code is structured and give you the most important concept that you need to understand in order to succeed at SQL now this is a SQL statement it is like a complete sentence in the SQL language the statement defines where we want to get our data from and how we want to receive these data including any processing that we want to apply to it and once we have a statement we can select run and it will give us our data now the statement is made up of building block blocks which we call Clauses and in this statement we have a clause for every line so the Clauses that we see here are select from where Group by having order and limit and clauses are really the building blocks that we assemble in order to build statements what this course is about is understanding what each Clause is and how it works and then understanding how we can put together these Clauses in order to write effective statements now the first thing that you need to understand is that there is an order to write in these Clauses you have to write them in the correct order and there is no flexibility there if you write them in the wrong order you will simply get an error for example if I I were to take the work clause and put it below the group Clause you can see that I’m already getting an error here which is a syntax error but you don’t have to worry about memorizing this now because you will pick up this order as we learn each clause in turn now the essential thing that you need to understand and that slows down so many SQL Learners is that while we are forced by SQL to write Clauses in this specific order this is not actually the order in which the Clauses are executed if you’ve interacted with another programming language such as python or or JavaScript you’re used to the fact that each line of your program is executed in turn from top to bottom generally speaking and that is pretty transparent to understand but this is not what is happening here in SQL to give you a sense of the order in which these Clauses are run on a logical level what SQL does is that first it reads the from then it does the wear then the group by then the having then it does the select part after the select part is do it does the order by and finally the limit all of this just to show that the order in which operations are executed is not the same as the order in which they’re written in fact we can distinguish three orders that pertain to SQL Clauses and this distinction is so important to help you master SQL the first level is what we call the lexical order and this is simply what I’ve just shown you it’s the order in which you have to write these Clauses so that SQL can actually execute the statement and not throw you an error then there’s the logical order and this is the order in which the clause are actually run logically in the background and understanding this logical order is crucial for accelerating your learning of SQL and finally for the sake of completeness I had to include the effective order here because what happens in practice is that your statement is executed by a SQL engine and that engine will usually try to take shortcuts and optimize things and save on processing power and memory and so the actual order might be a bit different because the Clauses might be moved around um in the process of optimization but like I said I’ve only included it for the sake of completeness and we’re not going to worry about that level in this course with we are going to focus on mastering the lexical order and The Logical order of SQL Clauses and to help you master The Logical order of SQL Clauses or SQL operations I have created this schema and this is the fundamental tool that you will use in this course this schema as you learn it progressively will allow you to build a powerful mental model of SQL that will allow you to tackle even the most complex and tricky SQL problems now what this schema shows you is all of the Clauses that you will work with when writing SQL statements so these are the building blocks that you will use in order to assemble your logic and then the sequence in which they’re shown is corresponding to The Logical order in which they are actually executed and there are three simple rules for you to understand this schema the first rule is that operations are EX executed sequentially from left to right the second rule is that each operation can only use data that was produced by operations that came before it and the third rule is that each operation cannot know anything about data that is produced by operations that follow it what this means in practice is that if you take any of these components for example the having component you already know that having will have access to data that was produced by the operations that are to to its left so aggregations Group by where and from however having will have absolutely no idea of information that is produced by the operations that follow for example window or select or Union and so on of course you don’t have to worry about understanding this and memorizing it now because we will tackle this gradually throughout the course and throughout the course we will go back to the schema again and again in order to make sense of the work we’re doing and understand the typical errors and Pitfall that happen when working with SQL now you may be wondering why there are these two cases where you actually see two components stacked on top of each other that being from and join and then select an alas these are actually components that are tightly coupled together and they occur at the same place in The Logical ordering which is why I have stacked them like this in this section we tackle the basic components that you need to master in order to write simple but powerful SQL queries and we are back here with our schema of The Logical order of SQL operations which is also our map for everything that we learn in this course but as you can see there is now some empty space in the schema because to help us manage the complexity I have removed all of the components that we will not be tackling in this section let us now learn about from and select which are really the two essential components that you need in order to write the simplest SQL queries going back now to our data let’s say that we wanted to retrieve all of the data from the characters table in the fantasy data set now when you have to write a SQL query the first question you need to ask yourself is where is the data that I need because the first thing that you have to do is to retrieve the data which you can then process and display as needed so in this case it’s pretty simple we know that the data we want leaves in the characters table once you figured out where your data leaves you can write the from Clause so I always suggest starting queries with the from clause and to get the table that we need we can write the name of the data set followed by a DOT followed by the name of the table and you can see that bigquery has recognized the table here so I have written the from clause and I have specified the address of the table which is where the data leaves and now I can write the select clause and in the select Clause I can specify which Columns of the table I want to see so if I click on the characters table here it will open in a new tab in my panel and as you remember the it shows me here the schema of the table and the schema includes the list of all the columns now I can simply decide that I want to see the name and the guilt and so in the select statement here I will write name and guilt and when I run this I get the table with the two columns that I need and one neat thing about this I could write the columns in any order it doesn’t have to be the original order of the schema and the result will show that order and if I I wanted to get all of The Columns of the table I could write them here one by one or I could write star with which is a shorthand for saying please give me all of the columns so this is the corresponding data to our table in Google Sheets and if you want to visualize select in your mind you can imagine it as vertically selecting the parts of the table that you need for example if I were to write select Guild and level this would be equivalent to taking these two columns over here and selecting them let us now think of The Logical order of these operations so first comes the from and then comes the select and this makes logical sense right because the first thing you need to do is to Source the data and later you can select the parts of the data that you need in fact if we look at our schema over here from is the very first component in The Logical order of operations because the first thing that we need to do is to get our data we have seen that the select Clause allows us to get any columns from our table in any order but the select Clause has many other powers so let’s see what else we can do with it one useful thing to know about SQL is that you can add comments in the code and comments are parts of text which are not uh executed as code they’re just there for you to um keep track track of things or or explain what you are doing so I’m going to write a few comments now and the way we do comments is by doing Dash Dash and now I’m going to show you aliasing aliasing is simply renaming a column so I could take the level column and say as character level provided a new name and after I run this we can see that the name of the colum has changed now one thing that’s important to understand as we now start transforming the data with our queries is that any sort of change that we apply such as in this case we change the name of the column it only affects our results it does not affect the original table that we are querying so no matter what we do here moving forward Ward the actual table fantasy characters will not change all that will change are the results that we get after running our query and of course there are ways to go back to Fantasy characters and permanently change it but that is outside the scope for us and going back to our schema you will see that Alias has its own component and it happens happens at the same time as the select component and this is important because as we will see in a bit that it’s a common temptation to use these aliases these labels that we give to columns in the phases that precede this stage which typically fails because as our rules say um every component does not have access to data that is computed after it so something that we will come back to now another power of Select that we want to show is constants and constants is the ability of creating new columns which have a constant value for example let’s say that I wanted to implement a versioning system for my characters and I would say that right now all the characters I have are version one but then in the future every time I change a character I will increase that version and so that will allow me to keep track of changes I can do that by simply writing one over here in the column definition and when I run this you will see that SQL has created a new column and it has put one for every Row in that column this is why we call it a constant column so if I scroll down down all of it will be one and this column has a weird name because we haven’t provided a name for it yet but we already know how to do this we can use the alas sync command to say to call it version and here we go so in short when you write a column name in the select statement SQL looks for that column in the table and gives you that column but when instead you write a value SQL creates a new column and puts that value in every Row the next thing that SQL allows me to do is calculations so let me call the experience column here as well and get my data now one thing I could do is to take experience and divide it by 100 so what we see here is a new column which is the result of this calculation now 100 is a constant value right so you can imagine in the background SQL has created a new column and put 100 in every row and then it has done the calculation between experience and that new column and we get this result and and in short we can do any sort of calculation we want combining current columns and constants as well for example although this doesn’t make any sense I could take experience add 100 to it divided by character level and then multiply it by two and and we see that we got an error can you understand why we got this error pause the video and think for a second I am referring to my column as character level but what is character level really it is a label that I have assigned over here now if we go back to our schema we can see that select and Alias happen at the same time so so this is the phase in which we assign our label and this is also the phase in which we try to call our label now if you look at our rules this is not supposed to work because an operation can only use data produced by operations before it and Alias does not happen before select it happens at the same time in other words this part part over here when we say character level is attempting to use information that was produced right here when we assigned the label but because these parts happen at the same time it’s not aware of the label all this to say that the logical order of operations matters and that what we want here is to actually call it level because that is the name of the column in the table and now when I run this I get a resulting number and so going back to our original point we are able to combine columns and constants with any sort of arithmetic operations another very powerful thing that SQL can do is to apply functions and a function is a prepackaged piece of logic that you can apply to our data and it works like this there is a function called sqrt which stands for square root which takes a number and computes the square root so you call the function by name and then you open round brackets and in round brackets you provide the argument and the argument can be a constant such as 16 or it can be a column such as level and when I run this you can see that in this case the square root of 16 is calculated as four and this creates a constant column and then here for each value of level we have also computed the square root there are many functions in SQL and they vary according to the data type which you provide as you remember we said that knowing the data types of columns such as distinguishing between numbers and text is important because it it allows us to know which operations we can apply and so there are functions that work only on certain data types for example here we see square root which only works on numbers but we also have text functions or string functions which only work on text one of them is upper so if I take upper and provide Guild as an argument what do you expect will happen we have created a new column where the G is shown in all uppercase so how can I remember which functions there are and how to use them the short answer is I don’t uh there are many many functions in SQL and here in the documentation you can see a very long list of all the functions that you can use in big query and as we said the functions vary according to the data that they can work on so if you look look here on the left we have array functions um date functions mathematical functions numbering functions time functions and so on and so on it is impossible to remember all of these functions so all you need to know is how to look them up when you need them for example if I know I need to work with numbers I could scroll down here and go to mathematic iCal functions and here I have a long list of all the mathematical functions and I can see them all on the right and I should be able to find the square root function that I have showed you and here the description tells me what the function does and it also provides some examples to summarize these are some of the most powerful things you can do with a select statement not only you can retrieve every column you need in any order you can rename columns according to your needs you can Define constant columns with a value that you choose you can combine columns and constant columns in all sorts of calculations and you can apply functions to do more complex work I definitely invite you to go ahead and put your own data in big query as a I’ve shown you and then start playing around with select and see how you can transform your data with it one thing worth knowing is that I can also write queries that only include select without the front part that is queries that do not reference a table let’s see how that works now after I write select I clearly cannot reference any columns because there is no table but I can still reference constant for example I could say hello one and false and if I run this I get this result so remember in SQL we always query tables and we always get back tables in this case we didn’t reference any previous table we’ve just created constants so what we have here are three columns with constant values and there is only one row in the resulting table this is useful mainly to test stuff so let’s say I wanted to make sure that the square root function does what I expect it to do so I could call it right here and uh look at the result let’s use this capability to look into the order of arithmetic operations in SQL so if I write an expression like this would you be able to compute the final result in order to do that you should be able to figure out the order in which all these operations are done and you might remember this from arithmetic in school because SQL applies the same order that is taught in school and we could Define the order as follows first you would execute any specific functions that take a number as Target and uh then you have multi multiplication and division then you have addition and subtraction and finally brackets go first so you first execute things that are within brackets so pause the video and apply these rules and see if you can figure out what this result will give us now let’s do this operation and do it in stages like we were doing in school so first of all we want to worry about what’s in the brackets right so I will now consider this bracket over here and in this bracket we have the multiplication and addition multiplication goes first so first I will execute this which will give me four and then I will have 3 + 4 + 1 which should give me 8 next I will copy the rest of the operation and here here I reach another bracket to execute what is in these brackets I need to First execute the function so this is the power function so it takes two and exponentiate it to the power of two which gives four and then 4 minus 2 will give me two and this is what we get now we can solve this line and first of all we need to execute multiplication and division in the order in which they occur so the first operation that occurs here is 4 / 2 which is 2 and I will just copy this for clarity 8 – 2 * 2 / 2 the next operation that occurs now is 2 * 2 which is 4 so that would be 8 – 4 / 2 and the next operation is 4 / 2 which is two so I will have 8 – 2 and all of these will give me a six now all of these are comments and we only have one line of code here and to see whether I was right I just need to execute this code and indeed I get six so that’s how you can use the select Clause only to test your assumptions and uh your operations and a short refresher on the order of arithmetic operations which will be important for solving certain sequal problems let us now see how the where statement works now looking at the characters table I see that there is a field which is called is alive and this field is of type Boolean that means that the value can be either true or false so if I go to the preview here and scroll to the right I can see that for some characters this is true and for others it is false now let’s say I only wanted to get those characters which are actually alive and so to write my query I would first write the address of the table which is fantasy characters next I could use the where Clause to get those rows where is a five is true and finally I could do a simple select star to get all the columns and here I see that I only get the rows where is alive is equal to true so where is effectively a tool for filtering table rows it filters them because it only keeps rows where a certain condition is true and discards all of the other rows so if you want to visualize how the wear Filter Works you can see it as a horizontal selection of certain slices of the table like in this case where I have colored all of the rows in which is alive is true now the we statement is not limited to Boolean Fields it’s not limited to columns that can only be true or false we can run the we filter on any column by making a logical statement about it for example I could ask to keep all the rows where Health number is bigger than 50 this is a logical statement Health bigger than 50 because it is either true or fals for every row and of course the wh filter will only keep those rows where this statement evaluates to true and if I run this I can see that in all of my results health will be bigger than 50 and I can also combine smaller logical statements with each other to make more complex logical statements for example I could say that I want all the rows where health is bigger than 50 and is a live is equal to true now all of this becomes one big logical statement and again this will be true or false for every row and we will only keep the rows where it is true and if I run this you will see that in the resulting table the health value is always above 50 and is alive is always true in the next lecture we will see in detail how these logical statements work and how we can combine them effectively but now let us focus on the order of operations and how the wear statement fits in there when it comes to the lexical order the order in which we write things it is pretty clear from this example first you have select then from and after from you have the WHERE statement and you have to respect this order when it comes to The Logical order you can see that the where Clause follows right after the from Clause so it is second actually in The Logical order if you think about it this makes a lot of sense because the first thing that I need to do is to get the data from where it Lees and then the first thing I want to do after that is is that I’m going to drop all the rows that I don’t need so that my table becomes actually smaller and easier to deal with there is no reason why I should carry over rows that I don’t actually need data that I don’t actually want and waste memory and processing power on it so I want to drop those unneeded rows as soon as possible and now that you know that where happens at this stage in The Logical order you can avoid many of the pitfalls that happen when you’re just learning SQL let’s see an example now take a look at this query I’m going to the fantasy characters table and then I’m getting the name and the level and then I’m defining a new column this is simply level divided by 10 and I’m calling this level scaled now let’s say that I wanted to only keep the rows that have at at least three as level scaled so I would go here and write aware filter where level scaled bigger than three and if I run this I get an error unrecognized name can you figure out why we get this error level scaled is an alas that we assign in the select stage but the we Clause occurs before the select stage so the we Clause has no way to know about this alias in other words the we Clause is at this point and our rules say that an operation can only use data produced by operations before it so the we Clause has no way of knowing about the label which is a sign at this stage so how can we solve this problem right here the solution is to not use the Alias and to instead repeat the logic of the transformation and this actually works because it turns out that when you write logical statements in the we filter you can not only reference The Columns of the tables but you can also reference operations on columns and this way of writing operations of on columns and combinations between columns works just as what we have shown in the select part so that was all you need to know to get started with the wear clause which is a powerful Clause that allows us to filter out the row that we don’t need and keep the rows that we need based on logical conditions now let’s delve a bit deeper into how exactly these logical statements work in SQL and here is a motivating example for you this is a selection from the characters table and we have a wear filter and this we filter is needlessly complicated and I did this intentionally because by the end of this lecture you should have no trouble at all interpreting this statement and figuring out for which rows it will be true and likewise you will have no problem writing complex statements yourself or deciphering them when you encounter them in the wild the way that these logical statements work is through something called Boolean algebra which is an essential theory for working with SQL but also for working with any other programming language and is indeed fundamental to the way that computers work and though the name may sound a bit scary it is really easy to understand the fundamentals of Boolean algebra now let’s look back at so-called normal algebra meaning the common form that is taught in schools in this algebra you have a bunch of elements which in this case I’m only showing a few positive numbers such as 0 25 100 you also have operators that act on these elements for example the square root symbol the plus sign the minus sign the division sign or the multiplication sign and finally you have operations right so in operations you apply The Operators to your elements and then you get some new elements out of them so here are two different types of operation in one case we take this operator the square root and we apply it to a single element and out of this we get another element in the second kind of operation we use this operator the plus sign to actually combine two elements and again we get another element in return Boolean algebra is actually very similar except that it’s simpler in a way because you can only have two elements either true or false those are all the elements that you are working with and of course this is why when there’s a Boolean field in SQL it is a column that can only have these two values which are true and false now just like normal algebra Boolean algebra has several operators that we can use to transform the elements and for now we will only focus on the three most important ones which are not and and or and finally in Boolean algebra we also have operations and in operations we combine operators and elements and get back elements now we need to understand how these operators work so let us start with the not operator to figure out how a Boolean operator works we have to look at something that’s called a truth table so let me look up the truth table for the not operator and in this Wikipedia article this is available here at logical negation now first of all we see that logical negation is an operation on one logical value what does this mean it means that the not operator works on a single element such as not true or not false and this this is similar to the square root operator in algebra that works on a single element a single number next we can see how exactly this works so given an element that we call P and of course P can only be true or false the negation of p is simply the opposite value so not true is false and not false is true and we can easily test this in our SQL code so if I say select not true what do you expect to get we get false and if I do select not false I will of course get true next let’s see how the end operator works so we’ve seen that the not operator works on a single element on the other hand the end operator connects two elements such as writing true and false and in this sense the end operator is more similar to the plus sign here which is connecting two elements so what is the result of true and false to figure this out we have to go back to our truth tables and I can see here at The Logical conjunction function section which is another word for the end operator now the end operator combines two elements and each element can either be true or false so this creates four combinations that we see here in this table and what we see here is that only if both elements are true then the end operator will return true in any other case it will return false so going back here if I select true and false what do you expect to see I am going to get false and it’s only in the case when I do true and true that the result here will be true and finally we can look at the or operator which is also known as a logical disjunction it’s also combining two elements it also has four combinations but in this case if at least one of the two elements is true then you get true and only if both elements are false then you get false and so going back to our SQL true or true will of course be true but but even if one of them is false we will still get true and only if both are false we will get false so now you know how the three most important operators in Boolean algebra work now the next step is to be able to solve long and complex Expressions such as this one and you already know how operators work the only information you’re missing is the order of operations and just like in arithmetic we have an agreed upon order of operations that helps us solve complex expressions and the Order of Operations is written here first you solve for not then you solve for and and finally for or and as with arithmetic you first solve for the brackets so let’s see how that works in practice let us now simplify this expression so the first thing I want to do is to deal with the brackets so if I copy all of this part over here as a comment so it doesn’t run as code you will see that this is the most nested bracket the innermost bracket in our expression and we have to solve for this so what is true or true this is true right and now I can copy the rest of my EXP expression up to here and here I can solve the innermost bracket as well so I can say true and what I have here is false and true so this is false right because when you have end both of them need to be true for you to return true otherwise it’s false so I will write false moving on to the next line I need to solve what’s in the bracket so I can copy the knot and now I have to solve what’s in this bracket over here now there are several operators here but we have seen that not has the Precedence right so I will copy true and here I have not false which becomes true and then I can copy the last of the bracket I’m not going to do any more at this step to avoid confusion and then I have or and I can solve for this bracket over here and true and false is actually false moving on I can keep working on my bracket and so I have a lot of operations here but I need to give precedence to the ends so the first end that occurs is this one and that means I have to start with this expression over here true and and true results in true and then moving on I will copy the or over here and now I have another end which means that I have to isolate this expression false and true results in false and finally I can copy the final end because I’m not able to compute it yet because I needed to compute the left side and I can copy the remaining part as well moving on to the next line um I need to still do the end because the end takes precedence and so this is the expression that I have to compute so I will say true or and then this expression false and true computes to false and then copy the rest now let me make some rul over here and go to the next line and I can finally compute this bracket we have true or false which we know is true next I need to invert this value because I have not true which is false and then I have or false and finally this computes to false and now for the Moment of Truth F intended I can run my code and see if the result actually corresponds to what we got and the result is false so in short this is how you can solve complex expressions in Boolean algebra you just need to understand how these three operators work and you can use truth tables like like this one over here to help you with that and then you need to remember to respect the order of operations and then if you proceed step by step you will have no problem solving this but now let’s go back to the query with which we started because what we have here is a complex logical statement that is plugged into the wear filter and it isolates only certain rows and we want to understand exactly how this statement works so let us apply what we’ve just learned about Boolean algebra to decipher this statement now what I’ve done here is to take the first row of our results which you see here and just copi the values in a comment and then I’ve taken our logical statement and copied it here as well so let us see what SQL does when it checks for this Row the first thing that we need to do is to take all of these statements in our wear filter and convert them to true or false and to do that we have to look at our data let us start with the first component which is level bigger than 20 so for the row that we are considering level is 12 so this comes out as false next I will copy this end and here we have is alive equals true now for our row is alive equals false so this statement computes as false Mentor ID is not null with null representing absence of data in our case Mentor ID is one so it is indeed not null so here we have true and finally what we have in here is class in Mage Archer so we have not seen this before but it should be pretty intuitive this is a membership test this is looking at class which in this case is Hobbit and checking whether it can be found in this list and in our case this is now false so now that we’ve plugged in all the values for our row what we have here is a classic Boolean algebra expression and we are able to solve this based on what we’ve learned so let us go and solve this and first I need to deal with the brackets and what I have here I have an end and an or and the end TR takes precedence so false and false is false and I will copy the rest and here I have not false which is true next we have false or true which is true and true and in the end this computes to true now in this case we sort of knew that the result was meant to come out as true because we started from a row that survived the wear filter and that means that for this particular row this statement had to compute as true but it’s still good to know exactly how SQL has computed this and understand exactly what’s going on and this is how SQL deals with complex logical statements for each row it looks at the relevant values in the row so that it can convert the statement to a Boolean algebra expression and then it uses the Boolean algebra rules to compute a final result which is either true or false and then if this computes as true for the row then the row is kept and otherwise the row is discarded and this is great to know because this way of resoling solving logical statements applies not only to the word component but to all components in SQL which use logical statements and which we shall see in this course let us now look at the distinct clause which allows me to remove duplicate rows so let’s say that I wanted to examine the class column in my data so I could simply select it and check out the results so what if I simply wanted to see all the unique types of class that I have in my data this is where distinct comes in handy if I write distinct here I will see that there are only four unique classes in my data now what if I was interested in the combinations between class and guilt in my data so let me remove the distinct from now and add guilt here and for us to better understand the results I’m going to add an ordering and here are the combinations of class and Guild in my data there is a character who is an Archer and belongs to Gondor and there are actually two characters who are archers and belong belong to mirkwood and there are many Hobbits from sholk and so on but again what if I was interested in the unique combinations of class and Guild in my data I could add the distinct keyword here and as you can see there are no more repetitions here Archer and merkwood occurs only once Hobbit and Shar f occurs only once because I’m only looking at unique combinations and of course I could go on and on and add more columns and expand the results to show the unique combinations between these columns so here Hobbit and sherol has expanded again because some Hobbits are alive and others unfortunately are not at the limit I could have a star here and what I would get back is actually my whole data all the 15 rows because what we’re doing here is looking at rows that have the same value on all columns rows that are complete duplicates and there are no such rows in the data so when I do select star in this case distinct has no effect so in short how distinct works it looks at the columns that you’ve selected only those which you have selected and then it looks at all the rows and two rows are duplicate if they have the exact same values on every column that you have selected and then duplicate rows are removed and only unique values are preserved so just like the wear filter the distinct is a clause that removes certain rows but it is more strict and less flexible in a sense it only want does one job and that job is to remove duplicate rows based on your selection and if we look at our map of SQL operations we can place distinct it occurs right after select right and and this makes sense because we have seen that distinct Works only on the columns that you have selected and so it has to wait for select to choose the columns that we’re interested in and then we can D duplicate based on those for the following lecture on unions I wanted to have a very clear example so I decided to go to the characters table and split it in two and create two new tables out of it and then I thought that I should show you how I’m doing this because it’s a pretty neat thing to know and it will help you when you are working with SQL in bigquery so here’s a short primer on yet another way to create a table in bigquery you can use your newly acquired power of writing cql queries to turn those queries into permanent tables so here’s how you can do it first I’ve written a simple query here and you should have no trouble understanding it by now go to the fantasy characters table keep only rows where is alive is true and then get all the columns next we need to choose where the new table will live and how it will be called so I’m placing it also in the fantasy data set and I’m calling it characters alive and finally I have a simple command which is create table now what you see here is a single statement in SQL it’s a single command that will create the table and you can have in fact multiple statements within the same code and you can run all the statements together when you hit run the trick is to separate all of them with this semicolon over here the semicolon tell SQL hey this command over here is over and and uh next I might add another one so here we have the second statement that we’re going to run and this looks just like the one above except that our query has changed because we’re getting rows where is alive is false and then we are calling these table characters dead so I have my two statements they’re separated by semicolons and I can just hit run and I will see over here that bigquery is showing me the two statements on two different rows and you can see that they are both done now so if I open my Explorer over here I will see that I have two new tables characters alive and characters dead and if I go here for characters alive is alive will of course be true on every row now what do you think would happen if I ran this script again let’s try it so I get an error the error says that the table already exists and this makes sense because I’ve told SQL to create a table but SQL says that table already exists I cannot create it again so there are ways that we can tell SQL what to do if the table already exists again so that we specify the behavior we want and we are not going to just get an error one way is to say create or replace table fantasy characters alive and what this will do is that if the table already exists uh big query will delete it and then create it again or in other words it will overwrite the data so let’s write it down to and let’s make sure that this query actually works so when I run this I will get no errors even if the table already existed because bigquery was able to remove the previous table and create a new one alternatively we may want to create the table only if it doesn’t exist yet and leave it untouched otherwise so in that case we could say create table if not exists so what this will do is that if this table is already existing big query won’t touch it and it won’t throw an error but if it doesn’t exist it will create it so let us write it down two and make sure that this query runs without errors and we see that also here we get no errors and that in short is how you can save the results of your queries in big query and make them into full-fledged tables that you can save and and create query at will and I think this is a really useful feature if you’re analyzing data in big query because any results of your queries that you would like to keep you can just save them and then come back and find them later let’s learn about unions now to show you how this works I have taken our characters table and I have split it into two parts and I believe the name is quite self descriptive there is a separate table now for characters who are alive and a separate table for characters who are dead and you can look at the previous lecture to see how I’ve done this how I’ve used a query to create two new tables but this is exactly the characters table with you know the same schema the same columns the same times is just split in two based on the E alive column now now let us imagine that we do not have the fantasy. characters table anymore we do not have the table with all the characters because it was deleted or we never had it in the first place and let’s pretend that we only have these two tables now characters alive and characters dead and we want to reconstruct the characters table out of it we want to create a table with all the characters how can we do that now what I have here are two simple queries select star from fantasy characters alive and select star from fantasy characters dead so these are two separate queries but actually in big query there are ways to run multiple queries at the same time so I’m going to show you first how to do that now an easy way to do that is to write your queries and then add a semicolon at the end and so what you have here is basically a SQL script which contains multiple SQL statements in this case two and if you hit run all of these will be executed sequentially and when you look at the results so you’re not just getting a table anymore because it’s not just a single query that has been executed but you can see that there have been two commands uh that have been executed which are here and then for each of those two you can simply click View results and you will get to the familiar results tab for that and if I want to see the other one I will click on the back arrow here and click on the other view results and then I can see the other one another way to handle this is that I can select the query that I’m interested in and then click run and here I see the results so big query has only executed the part that I have selected or I can decide to run the other query in my script select it click run and then I will see the results for that query and this is a pretty handy functionality in big query it’s also functionality that might give you some headaches if you don’t know about it because if for some reason you selected a part of the code uh during your work and then you just want to run everything you might hit run and get an error here because B queer is only seeing the part that you selected and cannot make sense of it so it’s good to know about this but our problem has not been solved yet because remember we want to reconstruct the characters table and what we have here are two queries and we can run them separately and we can look at the results separately but we still don’t have a single table with all the results and this is where Union comes into play Union allows me to stack the results from these two tables so so if I take first I will take off the semic columns because this will become a single statement and then in between these two queries I will write Union distinct and when I run this you can verify for yourself we have 15 rows and we have indeed reconstructed the characters table so what exactly is going on here well it’s actually pretty simple SQL is taking all of the rows from this first query and then all of the rows for the second query and then it’s stacking them on top of each other so you can really imagine the act of horizontally stacking a table on top of the other to create a new table which contains all of the rows of these two queries combined and that in short is what union does now there are a few details that we need to know when working with Union and to figure them out let us look at a toy example so I’ve created two very simple tables toy one and toy two and you can see how they look in these comments over here they all have three columns which are called imaginatively call One Call two call three and then this is the uh Toy one table and then this is the toy 2 table now just like before we can combine this table tabls by selecting all of them and then writing a union in between them now in B query you’re not allow to write Union without the further qualifier a keyword and it has to be either all or distinct so you have to choose one of these two and what is the choice about well if you do Union all you will get all of the rows that are in the first table and those that are in the second table regardless of whether they are duplicate okay but with Union distinct you will get again all of the rows from the two tables but you will only consider unique rows you will not get any duplicates now we can see that these two table share a column which is actually identical one through yes over here and the same row over here now if I write Union all I expect the result to include this row twice so let us verify that and you can see that here you have one true yes and at the end you also have one true yes and in total you get four rows which are all the rows in the two tables however if I do Union distinct I expect to get three rows and I expect this row to appear only once and not to be duplicated again you need to make sure you’re not selecting any little part of your script before you run it so the whole script will be run and as you can see we have three rows and there are no duplicates now it’s interesting that big query actually forces you to choose between all or distinct because in many SQL systems for your information you are able to write Union without any qualifier and in that case it means Union distinct so in other SQL systems when you write Union it is understood that you want Union distinct and if you actually want to keep the duplicate rows you will explicitly write Union all but in big query you always have to explicitly say whether you want Union all or Union distinct now the reason this command is called Union and not like stack or or something else is is that this is a set terminology right this comes from the mathematical theory of sets which you might remember from school and the idea is that a table is simply a set of rows so this table over here is a set of two rows and this table over here is a set of two rows and once you have two sets you can do various set operations between them and the most common operation that we do in SQL is unioning and unioning means combining the values of two sets so you might remember from school the V diagram which is a typical way to visualize the relations between sets so in this simple vent diagram we have two circles A and B which represent two sets and in our case a represents the collection of rows in the first table and B represents the all the rows that are in the second table so what does it mean to Union these sets it means taking all of the elements that are in both sets so taking all of the rows that are in both tables and what is the difference here between union distinct and Union all where you can see that the rows of a are this part over here plus this part over here and the rows of B are this part over here plus this part over here and so when we combine them we’re actually counting the intersection twice we are counting this part twice and so what do you do with this double counting do you keep it or do you discard it if you do Union all you will keep it so rows that are in common between A and B will duplicate you will see them twice twice but if you do Union distinct you will discard it and so um you won’t have any duplicates in the results so that’s one way to think about it in terms of sets but we also know that Union is not the only set operation right there are other set operations a very popular one is the intersect operation now the intersect looks like this right it it says take only the El elements that are in common between these two sets so can we do that in SQL can we say give me only the rows that are in common between the two tables and the answer is yes we can do this and if we go back here we can instead of Union write intersect and then distinct and what do you expect to see after I run this command take a minute to think about it so what I expect to see is to get only the rows that are shared between the two tables now there is one row which is shared between these two tables which is uh the one true yes row which we have seen and if I run this I will get exactly this row so intersect distinct gives me the rows that are shared between the two tables and I have to write intersect distinct I cannot write intersect all because actually doesn’t mean anything so it’s not going to work and here’s another set operation which you might consider which is subtraction so what if I told you give me all of the elements in a except the elements that a shares with B so what would that look on the drawing it would look like this right so this is taking all of the elements that are in a except these ones over here because they are in a but they’re also in B and I don’t want the elements shared with b and yes I can also do that in squl I can come here and I could say give me everything from Toy one except distinct everything from Toy two and what this means is that I want to get all of my rows from Toy one except the rows that are shared with toy two so what do you expect to see when I run this let’s hit run and I expect to see only this row over here because this other row is actually shared with b and this is what I get again you have to write accept distinct you cannot write accept all because it’s actually actually doesn’t mean anything and keep in mind that unlike the previous two operations which are union and distinct the accept operation is not symmetric right so if I swap the tables over here I actually expect to get a different result right I expect to see this row over here selected because I’m saying give me everything from this table uh Toy 2 except the rows that are shared with toy one so so let us run this and make sure and in fact I get the three through uh maybe row so careful that the accept operation is not symmetric the order in which you put the two tables matters so that was a short overview of Union intersect except and I will link this here which is the bigquery documentation on this and you can see that they’re actually called set operators in fact in real life you almost always see Union very rarely you will see somebody using intersect or accept a lot of people also don’t know about them but I think it’s worth it that we briefly looked at all three and it’s especially good for you to get used to thinking about tables as sets of rows and thinking about SQL operations in terms of set set operations and that will also come in handy when we study joints but let us quickly go back to our toy example and there are two essential prerequisites for you to be able to do a union or any type of sort operations number one the tables must have the same number of columns and number two the columns must have the same same data type so as you can see here we are combining toy 2 and toy 1 and both of them have three columns and the First Column is an integer the second is a Boolean and the third is a string in both tables and this is how we are able to combine them so what would happen if I went to the first table and I got only the first two columns and then I tried to combine it you guessed it I would get an error because I have a mismatched column count so if I want to select only the first two columns in a table I need to select only the first two columns in another table and then the union will work now what would happen if I messed up the order of the columns so let’s say that here I will select uh column one and column 3 and here I will select column one and column two let me run this and I will get an error because of incompatible types string and bull so what’s happening here is that SQL is trying to get the values of call three over here and put it into call two over here and it’s trying to get a string and put it into a Boolean column and that simply doesn’t work because as you know SQL enforces streak Types on columns and so this will not work but of course I could select call three in here as well and now again we will have a string column going into a string column and of course this will work so so to summarize you can Union or intersect or accept any two tables as long as they have the same number of columns and the columns have the same data types let us now illustrate a union with a more concrete example so we have our items table here and our characters table here so the items table repres represents like magical items right while the characters table we’re familiar with it represents actual characters so let’s say that you are managing a video game and someone asks you for a single table that contains all of the entities in that video game right and the entities include both characters and items so you want to create a table which combines these two tables into one we know we can use Union to do that we know we can use Union to stack all the rows but we cannot directly Union these two tables be because they have a different schema right they have a different number of columns and then those columns have different data types but let’s analyze what these two tables have in common and how we could maybe combine that so first of all they both have an ID and in both cases it’s an integer so that’s already pretty good they both have a name and in both cases the name is a string so we can combine that as as well the item type can be thought of being similar to the class and then each item has a level of power which is expressed as an integer and each character has a level of experience which is expressed as an integer and you can think that they are kind of similar and then finally we have a timestamp field representing a moment in time for both columns which are date added and last active so looking at this columns that the two have sort of in common we can find a way to combine them and here’s how we can translate this into SQL right so I’m went to the fantasy items table and I selected The Columns that I wanted and then I went to the characters table and I selected the columns that I wanted to combine with those um in in the right order so we have ID with ID name with name class with item type level with power and last active with date added so I have my columns they’re in the right order I wrote Union distinct and if I run this you will see that I have successfully combined the rows from these two tables by finding out which columns they have in common and then writing them in the right order and then adding Union distinct now all the columns that we’ve chosen for the combination have the same type but what would happen if I wanted to combine two columns that are not actually the same type so let’s say what if we wanted to combine Rarity which is a string with experience which is an integer as you know I cannot do this directly but I can go around it by either taking Rarity and turning it into an integer or taking um experience and turning it into a string I just have to make sure that they both have the same data type now the easiest way is usually to take um any other data type and turn it into a string because we you just turn it into text so let’s say that for the sake of this demonstration we will take integer experience which is an integer and turn it into a string which is text and then combine that with Rarity so I will go back to my code and I will make some room over here and here in items I will add Rarity and here in characters I will add experience and you can see that I already get an error here saying that the union distinct has incompatible types just like expected so what I want to do here is to take experience and turn it into string and I can do that with the cast function so I can do cast experience as string and what this will do is basically take these values and convert them to string and if I run this you can see that this has worked so we combined two tables into one and now the result is a single table it has a column called Rarity the reason it’s called Rarity is that um it’s it’s taking the name from the first table in the in the operation but we could of course rename it to whatever we need and this is now a text column because we have combined a text column with also a text column thanks to the casting function so what we see here are a bunch of numbers which came originally from The Experience uh column from the character table but they’re now converted to text and if I scroll down then I will also see the original values of Rarity from the items table finally let us examine Union in the context of The Logical order of SQL operations so you can see here that we have our logical map but it looks a bit different than usual and the reason it’s different is that we are considering what happens when you un two tables and here the blue represents one table and the red represents the other table so I wanted to show you that all of the ordering that we have seen until now so first get the table then use the filter with where then select the columns you want and if you want use this thing to remove duplicates all of these happens in the same order separately for the two tables that you are unioning and this applies to all of the other operations like joining and grouping which we will see um later in the course so at first the two tables are working on two separate tracks and SQL is doing all this operations on them in this specific order and only at the end of all this only after all of these operations have run then we have the union and in the Union these two tables are combined into one and only after that only after the tables have been combined into one you apply the last two operations which are order by and limit and actually nothing forces you to combine only two tables you could actually have any number of tables that you are combining in Union but then the logic doesn’t change at all all of these operations will happen separately for each of the tables and then only when all of these operations are done only when all of the tables are ready then they will be combined into one and if you think about it it makes a lot of sense because first of all you need the select to have run in order to know what is the schema of the tables that you are combining and then you also also need to know if distinct has run on each uh table because you need to know which rows you need to combine in the union and that is all you need to know to get started with Union this very powerful statement that allows us to combine rows from different tables let us now look at order by so I’m looking at the characters table here and as you can see we have an ID column that goes from one to 15 which assigns an ID to every character but you will see that the IDS don’t appear in any particular order and in fact this is a general rule for SQL there is absolutely no order guarantee for your data your data is not stored in any specific order and your data is not going to be returned in any specific order and the reason for this is fun fundamentally one of efficiency because if we had to always make sure that our data was perfectly ordered that would add a lot of work it would add a lot of overhead to the engine that makes the queries work and uh there’s really no reason to do this however we do often want to order our data when we are querying it we want to order the way that it is displayed and this is why the order by clause is here so let us see how it works I am selecting everything from fantasy characters and again I’m going to get the results in no particular order but let’s say I wanted to see them in uh ordered by name so then I would do order by name and as you can see the rows are now ordered alphabetically according to the name I could also invert the order by writing desk which stands for descending and that means U descending alphabetical order which means from the last letter in the alphabet to the first I can of course also order by number columns such as level and we would see that the level is increasing here and of course that could also be descending to to go in the opposite direction and the corresponding keyword here is ask which stands for ascending and this is actually the default Behavior so even if you omit this you will get the same going from the smallest to the largest I can also order by multiple columns so I could say order by class and then level and what that looks like is that first of all the rows are ordered by class so as you can see this is done alphabetically so first Archer and then the last is Warrior and then within each class the values within the class are ordered according to the level going from the smallest level to the biggest level and I can invert the order of one of them for example class and in this case we will start with Warriors and then within the warrior class we will still will order the level in ascending order so I can for every column uh that’s in the ordering I can decide whether that ordering is in ascending order or descending order now let us remove this and select the name and the class and once again I get my rows in no particular order and I’m seeing the name and the class so I wanted to show you that you can also order by columns which you have not selected Ed so I could order these elements by level even though I’m not looking at at level and it will work all the same and finally I can also order by operations so I could say take level divide it by experience and then multiply that by two for some reason and it would also work in the order ordering even though I am not seeing that calculation that calculation is being done in the background and used for the ordering so I could actually take this here and copy it create a new column call it calc for calculation and if I show you this you will see the results are not uh very meaningful but you will see that they are in ascending order so we have ordered by that and sometimes you will see this notation over here order by 21 for example and as you can see what we’ve done here is that we’ve ordered by class first of all because we starting with archers and going to Warriors and then within each class we are ordering by name uh also in ascending order so this is basically referring to the columns that are referenced in the select two means order by the second column which you have referenced which in this case is class and one means order by the First Column that you referenced so it’s basically a shortcut that people sometimes use to avoid rewriting the names of columns that they have selected and finally when we go back to the order of operations we can see that order bu is happening really at the end of all of this process so as you will recall I have created this diagram that’s a bit more complex to show show what happens when we Union different tables together what happens is that basically all these operations they run independently on each table and then finally the tables get uh unioned together and after all of this is done SQL knows the final list of rows that we will include in our results and that’s the right moment to order those rows it would not be possible to do that before so it makes sense that order is located here let us now look at the limit Clause so what I have here is a simple query it goes to the characters table it filters for the rows where the character is alive and then it gets three columns out of this so let’s run this query and you can see that this query returns 11 rows now let us say that I only wanted to see five of those rows and this is where limit comes into place limit will look at the final results and then pick five rows out of those results reducing the size of my output and here you can see that we get five rows now as we said in the lecture of ordering by default there is no guarantee of order in a SQL system so when you are getting all your data with a query and then you run limit five on top of it you have no way of kn knowing which of the rows will be selected to fit amongst those five you’re basically saying that you’re okay with getting any five of all of the rows from your result because of this people often will use limit in combination with order by for example I could say order by level and then limit five and what I would get here is essenti the first five most inexperienced characters in my data set and let us say that you have a problem of finding the least experienced character in your data the character with the lowest level so of course you could say order by level and then limit one and you would get the character with the lowest level right and this works however it is not ideal there is a problem with this solution so can you figure out what the problem with this solution is the problem will be obvious once I go back to limit 5 and I look here and I see that I actually have two characters which have the lowest level in my data set so in theory I should be able to return both of them because they both have the lowest level however when I write limit one it simply cuts the rows in my output and it is unaware of that uh further information that is here in this second row and in the further lectures we will see how we can solve this better and get results which are more precise and if we look at The Logical order of operations we can see that limit is the very last operation and so all of the logic of our query is executed all our data is computed and then based on that final result we sometimes decide to not output all of it but to Output a limited number of rows so a common mistake for someone who is starting with SQL is thinking that they can use limit in order to have a cheaper query for example you could say oh this is a really large table this table has two terabytes of data it would cost a lot to scan the whole table so I will say select star but then I will put limit 20 because I only want to see the first 20 rows and that will means that I will only scan 20 rows and my query will be very cheap right no that is actually wrong that doesn’t save you anything and you can understand this if you look at the map because all of the logic is going to execute before you get to limit so you’re going to scan the whole table when you say select star and you’re going to apply all of the logic and the limit is actually just changing the way your result is displayed it’s not actually changing the way the your result is computed if you did want to write your query so that it scans less rows one thing you should do is focus on the where statement actually because the where statement is the one that runs in the beginning right after getting the table and it is able to actually eliminate rows which usually saves you on computation and money and so on however I do need to say that there are systems where writing limit may actually turn into savings because different systems are optimized in different ways and um allow you to do different things with the commands but as a rule usually with SQL limit is just changing the way your result is displayed and doesn’t actually change anything in the logic of execution let us now look at the case clause which allows us to apply conditional logic in SQL so you can see here a simple query I am getting the data from the characters table I am filtering it so that we only look at characters who are alive and then for each character we’re getting the name and the level now when you have a column that contains numbers such as level one typical thing that you do in data analysis is bucketing and bucketing basically means that I look at all these multiple values that level can have and and I reduce them to a smaller number of values so that whoever looks at the data can make sense of it uh more easily now the simplest form of bucketing that you can have is the one that has only two buckets right so looking at level our two buckets for example could be uh in one bucket we put values that are equal or bigger than 20 so characters who have a level that’s at least 20 and in the other bucket we put all the characters that have a level that is less than 20 for example now how could I Define those two buckets so we know that we can Define new columns in the select statement and that we can use calculations and logical statements to define those columns so one thing that I could do would be to go here and then write level bigger than bigger or equal than 20 and then call this new column level at least 20 for example and when I run this I get my column now of course this is a logical statement and for each row this will be true or false and then you can see that our new column here gives us true or false on every column and this is a really basic form of bucketing because it allows us to take you know level has basically 11 different values in our data and it can be complicated to look at this many values at once and now we’ve taken these 11 values and reduced them to two uh to two buckets so that we have um organized our data better and it’s easier to read but there are two limitations with this approach one I might not want to call my buckets true or false I might want to give more informative names to my buckets such as experienced or inexperienced for example the other limitation is that with this approach I can effectively only divide my data in two buckets because once I write a logical statement it’s either either true or false so my data gets divided in two but often it’s the case that I want to use multiple buckets for my use case now bucketing is a typical use case for the case when statement so let’s see it in action now so let me first write a comment not any actual code where I Define what I want to do and then I will do it with the code so I have written here the buckets that I want to use to classify the characters level so up to 15 they are considered low experience between 15 and 25 they are considered mid and anything above 25 we will classify as super now let us apply the case Clause to make this work so the case Clause Is Always bookended by these two parts case and end so it starts with case it ends with end and a typical error when you’re getting started is to forget about the end part so my recommendation is to always start by writing both of these and then going in the middle to write the rest now in the middle we’re going to Define all the conditions that we’re interested in and each condition starts with the keyword when and is Then followed by a logical condition so our logical condition here is level smaller than 15 now we have to Define what to do when this condition is true and it follows with the keyword then and when this condition is true we want to return the value low which is a string a piece of text that says low next we proceed with the following condition so when level is bigger and equal to 15 and level is lower than 25 so if you have trouble understanding this logical statement I suggest you go back to the lecture about Boolean algebra but what we have here there are two micro statements right Level under 25 and level equal or bigger than 15 they are conect connected by end which means that both of these statements have to be true in order for the whole statement to be true which is what we want in this case right and what do we want to return in this case we will return the value mid and the last condition that we want to apply when level is bigger or equal than 25 then we will return super now all of this that you see here this is the case Clause right or the case statement and all of this is basically defining a new column in my table and given that it’s a new column I can use the alas sync to also give it a name and I can call this level bucket now let’s run this and see what we get and as you can see we have our level bucket and the characters that are above 25 are super and then we have a few Ms and then everyone who’s under 15 is low so we got the results we wanted and now let us see exactly how the case statement works so I’m going to take Gandalf over here and he has level 30 so I’m going to write over here level equals 30 because we’re looking at the first low row and that is the value of level and then I’m going to take the conditions for the case statement that we are examining and add them here as a comment now because in our first row level equals 30 I’m going to take the value and substitute it here for level now what we have here is a sequence of logical statements and we have seen how to work with these logical statements in the lecture on Boolean algebra now our job is to go through each of these logical statements in turn and evaluate them and then as soon as we find one that’s true we will stop so the first one is 30 smaller than 50 now this is false so we continue the second one is a more complex statement we have 30 greater or equal to 15 which is actually true and 30 Oops I did not substitute it there but I will do it now and 30 smaller than 25 which is false and we know from our Boolean algebra that true and false evaluates to false therefore the second statement is also false so we continue and now we have 30 greater or equal than 25 which is true so we finally found a line which evaluates as true and that means that we return the value super and as you can see for Gandalf we have indeed gotten the value super let us look very quickly at one more example we get Legolas which is level 22 and so I will once again copy this whole thing and comment it and I will substitute 22 for every value of level cuz that’s the row we’re looking at and then looking at the first row 22 small than 15 is false so we proceed and then looking at the second row 22 bigger than 15 is true and 22 smaller than 25 is also true so we get true and true which evaluates to true and so we return mid and then looking at Legolas we get mid so this is how the case when statement Works in short for each row you insert the values that correspond to your Row in this case the value of level and then you evaluate each of these logical conditions in turn and as soon as one of them returns true then you return the value that corresponds to that condition and then you move on to the next row now I will clean this up a bit and now looking at this statement now and knowing what we know about the way way it works can we think of a way to optimize it to make it nicer to remove redundancies think about it for a minute now one thing we could do to improve it is to remove this little bit over here because if you think about it this part that I have highlighted is making sure that the character is not under 15 so that it can be classified as meat but actually we already have the first condition that makes sure that if the character is under 15 then the statement will output low and then move on so if the character is under 15 we will never end up in the second statement but if we do end up in the second statement we already know that the character is not under 15 this is due to the fact that case when proceeds condition by condition and exits as soon as the condition is true so effectively I can remove this part over here and then at the second condition only make sure that the level is below 25 and you will see if you run this that our bucketing system works just the same and the other Improvement that I can add is to replace this last line with an else CL Clause so the else Clause takes care of all the cases that did not meet any of the conditions that we specified so the case statement will go condition by condition and look for a condition that’s true but in the end if none of the conditions were true it will return what the else Clause says so it’s like a fallback for the cases when none of our conditions turned out to be true and if you look at our logic you will see that if this has returned false and this has returned false all that’s left is characters that have a level which is either 25 or bigger than 25 so it is sufficient to use an else and to call those super and if I run this you will see that our bucketing works just the same for example Gandalf is still marked as super because in the case of Gandalf this condition has returned false and this condition has returned false and so the else output has been written there now what do you think would happen if I completely removed the else what do you think would happen if I only had two conditions but it can be the the case that none of them is true what will SQL do in that case let us try it and see what happens so the typical response in SQL when it doesn’t know what to do is to select the null value right and if you think about it it makes sense because we have specified what happens when level is below 15 and when level is is below 25 but none of these are true and we haven’t specified what we want to do when none of these are true and because we have been silent on this issue SQL has no choice but to put a null value in there so this is practically equivalent to saying else null this is the default behavior for SQL when you don’t specify an else Clause now like every other piece of SQL the case statement is quite flexible for instance you are not forced to always create a text column out of it you can also create an integer column so you could Define a simpler leveling system for your characters by using one and two else three for the higher level characters and uh this of course will also work as you can see here however one thing that you cannot do is to mix types right because what this does is that it results in one column in a new column and as you know in SQL you’re not allowed to mix types between columns so always keep it consistent when it comes to typing and then when it comes to writing the when condition all the computational power of SQL is available so you can reference columns that you are not selecting you can run calculations as I am doing here and you can change logical statements right Boolean statements in complex ways you can really do anything you want although I generally suggest to keep it as simple as possible for your sake and the sake of the people who use your code and that is really all you need to know to get started with the case statement to summarize the case statement allows us to define a new columns whose values are changing conditional on the other values of my row this is also called conditional logic which means that we consider several conditions and then we do have different behaviors based on which condition is true and the way it works is that in the select statement when you are mentioning all your columns you create a new column which in our case is this one and you bookend it with a case and end and then between those you write your actual conditions so every condition starts with a when is followed by a logical statement which needs to evaluate to true or false and then has the keyword then and then a value and then the case when statement will go through each of these conditions in turn and as soon as one of them evaluates to true you will output the value that you have specified if none of the conditions evaluate to true then it will output the value that you specify in the else keyword and if the lse keyword is missing it will output null and so this is what you need to use the case statement and then experience and exercise and coding challenges will teach you when it’s the case to use it pun intended now where does the case statement fit in our logical order of SQL operations and the short answer is that it is defined here at the step when you are selecting your columns that’s when you can use the case when statement to create a new column that applies your conditional logic and this is the same as what we’ve shown in the lecture on SQL calculations you you can use select statement not only to get columns which already exist but to Define new columns based on calculations and logic now let us talk about aggregations which are really a staple of any sort of data analysis and an aggregation is a function that takes any number of values and compresses them down to a single informative value so I’m looking here at at my usual characters table but this is the version that I have in Google Sheets and as you know we have this level column which contains the level of each character and if I select this column in Google Sheets you will see that in the bottom right corner I can see here a number of aggregations on this column and like I said no matter how many values there are in the level columns I can use aggregations to compress them to one value and here you see some of the most important aggregations that you will work with some simply adding up all values together the average which is doing the sum and then dividing by the number of values the minimum value the maximum the count and the count numbers which is the same here so these are basically summaries of my column and you can imagine in cases where where you have thousands or millions of values how useful these aggregations can be for you to understand your data now here’s how I can get the exact same result in SQL I simply need to use the functions that SQL provides for this purpose so as you can see here I’m asking for the sum average minimum maximum and count of the column level and you can see the same results down here now now of course I could also give names to this column for example I could take this one and call it max level and in the result I will get a more informative column name and I can do the same for all columns now of course I can run aggregations on any columns that I want for example I could also get the maximum of experience and call this Max experience and I can also run aggregations on calculations that involve multiple columns as well as constants so everything we’ve seen about applying arithmetic and logic in SQL also applies now of course looking at the characters table we know that our columns have different data types and the behavior of the aggregate functions also is sensitive to the data types of the columns for example let us look at the many text columns that we have such as class now clearly not all of the aggregate functions that we’ve seen will work on class because how would you take the average of these values it’s not possible right however there are some aggregate functions that also work on strings so here’s an example of aggregate functions that we can run on a string column such as class first we have count which simply counts the total number of non null values and I will give you a bit more detail about the count functions soon then we have minimum and maximum now the way that strings are ordered in SQL is something called lexicographic order which is basically a fancy word for alphabetical order and basically you can see here that for minimum we get the text value that occurs earlier in uh alphabetical order whereas Warrior occurs last and finally here’s an interesting one called string EG and what this does is that this is a function that actually takes two arguments the first argument as usual is the name of the column and the second argument is a separator and what this outputs is now a single string a single piece of text where all of the other pieces of text have been glued together and then separated by this character that we specified over here which in our case is a comma Now if you go to the Google documentation you will find an extensive list of all the aggregate functions that you can use in Google SQL and this includes the ones that we’ve just seen such as average or Max as well as a few others that we will not explore in detail here so let us select one of them such as average and see what the description looks like now you can see that this function Returns the average of all values that are not null and don’t worry about this expression in an aggregated group for now just think about this as meaning all the values that you provide to the function all the values in the column now there is a bit about window functions which we will see later and here there are in the caveat section there are some interesting edge cases for example what happens if you use average on an empty group or if all values are null in that case it returns null and so on you could see what the function does when it finds these edge cases and here is perhaps the most important section which is supported argument types and this tells you what type of columns you can use this aggregation function on so you can see that you can use average on any numeric input type right any column that contains some kind of number and also on interval and interval we haven’t examined it in detail but this is actually a data type that specifies a certain span of time so interval could express something like 2 hours or 4 days or 3 months it is a quantity of time and finally in this table returned data types you can see what the average function will give you based on the data type that you insert so if you insert uh integer column it will return to you a float column and that makes sense because the average function involves a division and that division will usually give you floating Point values but for any other of the allowed input types such as numeric bit numeric and so on and these are all data types which represent numbers in B query the average function as you can see here will present Reserve that data type and finally we have some examples so whenever you need to use an aggregate function that is whenever you need to take many values a sequence of multiple values and compress them all down to one value but you’re not sure about which function to use or what the behavior of the function is you can come to this page and look up the functions that interest you and then read the documentation to see how they work now here’s an error that typically occurs when starting out with aggregations so you might say well I want to get the name of each character and their level but I also want to see the average of all levels and because I want to compare those two values I want to compare the level of my character with the average on all levels so I can write a query that looks like this right go to the Fant as a characters table and then select name level and then average level but as you can already see this query is not functioning it’s giving me an error and the error says that the select list expression references column name which is neither grouped nor aggregated so what does this actually mean to show you what this means I’ve gone back to my Google Sheets where I have the same data for my characters table and I have copy pasted our query over here now what this query does it takes the name column so I will copy paste it over here and then it takes the level column copy paste this here as well and then it computes the average over level now I can easily compute this with sheet formula by writing equal and then calling the function which is actually called average and then within the function I can select all these values over here and I get the average now this is the result that SQL computes but SQL is actually not able to return this result and the reason is that there are three columns but they have mismatch number of values specifically these two columns have 15 values each whereas this column has a single value and SQL is not able to handle this mismatch because as a rule every SQL query needs to return a table and a table is a series of columns where each column has the same number of values if that constraint is not respected you will get an error in SQL and we will come back to this limitation when we examine Advanced aggregation techniques but for now just remember that you can mix non-aggregated columns with other non-aggregated columns such as name and level and you can mix aggregated columns with aggregated columns such as average level with some level for example so I could simply do this and I would be able to return this as a table because as you can see there are two columns both columns have a single Row the number of rows matches and this is actually valid but you might ask can’t I simply take this value over here and just copy it in every row and until I make sure that average level has the same number of values as name and level and so return a table and respect that constraint indeed this is possible you can totally do this and then it would work and then this whole table would become a single table and you would be able to return this result however this requires the use of window functions which is a a feature that we will see in later lectures but yes it is totally possible and it does solve the problem now here’s a special aggregation expression that you should know about because it is often used which is the count star and count star is simply counting the total number of rows in a table and as you can see if I say from fantasy characters select count star I get the total count of rows in my results and this is a common expression used across all SQL systems to figure out how many rows a table has and of course you can also combine it with filters with the wear clause in order to get other types of measures for example I could say where is alive equals true and then the count would become actually the count of characters who are alive in my data so this is a universal way to count rows in SQL although you should know that if you’re simply interested in the total rows of a table and you are working with bigquery an easy and totally free way to do it is to go to the details Tab and look at the number of rows here so this was all I wanted to tell you about simp Le aggregations for now and last question is why do we call them simple simple as opposed to what I call them simple because the way we’ve seen them until now the aggregations take all of the values of a column and simply return One summary value for example the sum agregation will take all of the values of the level column and then return a single number which is the sum of all levels and more advanced aggregations involved grouping our data for example a question we might ask is what is the average level for Mages as opposed to the average level for Archers and for Hobbits and for warriors and so on so now you’re Computing aggregations not over your whole data but over groups that you find in your data and we will see how to do that in the lecture on groupi but for now you can already find out a lot of interesting stuff about your data by running simple aggregations let us now look at subqueries and Common Table expressions and these are two fundamental functionalities in SQL these functionalities solve a very specific problem and the problem is the following sometimes you just cannot get the result you require with a single query sometimes you have to combine multiple SQL queries to get where you need to go so here’s a fun problem that will illustrate my point so we’re looking at the characters table and we have this requirement we want to find all those characters whose experience is between the minimum and the maximum maximum value of our experience another way to say this we want characters who are more experienced than the least experienced character but less experienced than the most experienced character in other words we want to find that middle ground that is between the least and the most experienced characters so let us see how we could do that uh I have here A Simple Start where I am getting the name and experience column from the characters table now let us focus on the first half of the problem find characters who have more experience than the least experienced character now because this is a toy data set I can sort of eyeball it so I can scroll down here and I can see that the lowest value of experience is pipin with 2100 and so what I need to do now is to filter out from this table all the rows that have this level of experience but apart from eyeballing how would we find the lowest level of experience in our data if you thought of aggregate functions you are right so we have seen a in a previous lecture that we have aggregated functions that take any number of values and speed out a single value that’s a summary for example meing minum and maximum and indeed we need to use a function like that for this problem so your first instinct might be let us take this table and let us filter out rows in this way so let’s say where experience is bigger than the minimum of experience and on the surface this makes sense right I am using an aggregation to get the smallest value of experience and then I’m only keeping rows that have a higher value than that however as you see from this red line this actually does not work because it tells us aggregate function is not allowed in the work Clause so what is going on here so if you followed the lecture on aggregation you might have a clue as to why this doesn’t work but it is good to go back to to it and understand exactly what the problem is so I’m going back to my Google sheet over here where I have the exact same data and I copied our current query down here and now let’s see what happens when SQL tries to run this so SQL goes to the fantasy characters table and the Second Step In The Logical order as you remember is to filter it and for the filter it has to take the column of experience so let me take this column and copy it down here and then it has to compute minimum of experience right so I will Define this column here and I will use Google Sheets function to achieve that result so equals mean and then selecting the numbers and here I get the minimum value of experience and now SQL has to compare these column but this comparison doesn’t work right because these are two columns that have a different number of rows they have a different number of values so SQL is not able to do this comparison you cannot do an element by element comparison between a column that has 15 values and a column that has a single value so SQL throws an error but you might say wait there is a simple solution to this just take this value and copy it all over here until you have two columns of the same size and then you can do the comparison indeed that would work that’s a solution but SQL doesn’t do it automatically whereas if you work with other analytics tools such as pandas in python or npy you will find that um in a situation like this this would be done automatically this would be copied all over here and there’s a process called broadcasting for that but SQL does not take so many assumptions and so many risks with your data if it literally doesn’t work then SQL will not do it so hopefully now you have a better understanding of why this solution does not work so how could we actually approach this problem now a Insight is that I can run a different query so I will open this on the right to find out the minimum experience right I can go back to the characters table and I can select the minimum of experience this is simply what we’ve learned to do in the lecture on aggregations and I get the value here that is the minimum value of experience now that I know the minimum value of experience I could simply copy this value and insert it here into a wear filter and if I run this this will actually work it will solve my problem the issue of course is that I do not want to hard code this value first of all it is not very practical to run a separate query and copy paste the value in the code and second the minimum value might change someday and then I might not remember to update it in my code and then this whole query would become invalid to solve this problem I will use a subquery and I will simply delete the hardcoded value and I will open round brackets which is a way to get started on a subquery and I will take the query that I have over here and put them put it in the round brackets and when I run this I get the result that I need so what exactly is going on here we are using a subquery or in other words a query within a query so when SQL looks at this code it says all right so this is the outer query right and it has a inner query inside it a nested query so I have to start with the innermost query I have to start with the nested query so let me compute this and so SQL runs this query first and then it gets a value out of it which in our case we know that is 2100 and after that SQL substitutes this code over here by the value that was computed and we know from before that this works as expected and to compute the other half of our problem we want our character to have less experience than the most experienced character so this is just another condition in the wear filter and so I can add an end here and copy this code over here except that now I want my experience to be smaller than the maximum of EXP experience in my table now you might know this trick that if you select only part of your code like this and then you click run SQL will only execute that part of the code and so here we get the actual maximum for our experience and we can write it here in the comment and now we know that when SQL runs this query all of these will be computed to 15,000 and then experience will will be compared on that and the query will work as intended and here is the solution to our problem now here’s the second problem which shows another side of subqueries we want to find the difference between a character’s experience and their mentors so let us solve it manually for one case in the characters table so let us look at this character over here which is Saran with id1 and their experience is 8500 and then Saruman has character id6 as their Mentor so if I look for id6 we have Gandalf this is not very Canon compared to the story but let’s just roll with it and Gandalf has 10,000 of experience and now if we select the experience of Gandalf minus the experience of Saran we can see that there is A500 difference between their experience and this is what I want to find with my query now back to my query I will first Alias my columns in order to make them more informative and this is a great trick trick to make problems clearer in your head assign the right names to things so here instead of ID I will call this mentee ID and here I have Mentor ID and here instead of experience I will call this Mente experience so I have just renamed my columns now the missing piece of the puzzle is the mentor experience right so how can I get the mentor experience for example in the first case I know that character 11 is mentored by character 6 how can I get the experience of character six now of course I can take a new tab over here split it to the right go to Fantasy characters filter for ID being equal to six which is the ID of our mentor and get their experience and the experience in this case is 10,000 this is the same example that we saw before but now I would have to write this separate query for each of my rows so here six I’ve already checked but then I will need to check two and seven and one and this is really not feasible right and the solution of course is to solve it with a subquery so what I’m going to do here is open round brackets and in here I will write the code that I need and here I can simply copy the code that I’ve written here get experience from the characters where ID equals six now the six part is still hardcoded because in the first row Mentor ID is six to avoid hardcoding this part there are two components to this the first one is noticing that I am referencing the same table fantasy. characters in two different places in my code and this could get buggy and this could get confusing and the solution is to give separate names to these two instances now what are the right names to give so if we look at this outer query right here this is really information about the M te right because we have the Mente ID the ID of their mentor and the Mente experience so I can simply call this Mente table and as you can see I can Alias my table by simply writing it like this or I could also add the as keyword it would work works just the same on the other hand this table will give us the experience of the mentor this is really information about the mentor so we can call this Mentor table now we’re not going to get confused anymore because these two instances have different names and now what do we want this ID to be if we’re not going to hardcode it we want it to be this value over here we want it to be the mentor ID value from the Mente table we want it to be the M’s mentor and to refer to that column I will get the table name dot the column name so this is telling me get the mentor ID value from mentee table and now that I have the subquery which defines a colum with these two brackets I can Alias the result just like I always do and run this and now you will see after making some room here that we have successfully retrieved The Experience value for the mentor now I realize that this is not the simplest process so let us go back to our query over here and make sure that we understand exactly what is happening now first of all we are going to the characters table which contains information about our mentee the person who is being mentored and we label the table so that we remember what it’s about we filter it because we’re not interested in characters that do not have a mentor and then we’re getting a few data right the ID in this case represents the IDE of the mentee and we also have their Mentor ID and we also have the experience which again this is the table about the Mente represents the mentee experience now our goal is to also get the experience of their Mentor our goal is to see that we have a mentor id6 and we want to know that their experience is 10,000 and we do that with a subquery it’s a query within a query and in this subquery which is an independent piece of SQL code we are going back to the characters table but this is another instance of the table right that we’re looking at so to make sure we remember that we call this Mentor table because it contains information about the mentor and how do we make sure that we get the right value over here that we don’t get confused between separate mentors we make sure that for each row the ID of the character in this table is equal to the mentor ID value in the menty table in other words we make sure that we plug in this value over here in this case six into the table to get the right row and then from that row we get the experience value all of these code over here defines a new column which we call Mentor experience and this is basically the same thing that we did manually when we opened a table on the right and queried the table and copy pasted a hardcoded value this is just the way to do it dynamically with a subquery now we are not fully done with the problem right because we wanted to see the difference between the characters experience and their mentors so let’s see how to do this and the way to do it is with a column calculation just like the ones we’ve seen before so given that this column represents the mentor experience I can remove the Alias over here and over here as well and I can subtract the experience from this and a column minus a column gives me another column which I can then Alias as experience difference and if I I run this I will see the value that we originally computed manually which is the difference between the mentor and the Mente experience there’s nothing really new about this as long as you realize that this expression over here defines a column and this is the reference to a column and so you can subtract them and then give a name an alias to the result and now we can look at our two examples of nested queries side by side and we can figure out what they have in common and where do they differ so what they have in common is that they’re both problem that you cannot resolve with a simple query because you need to use values that you have to compute separately values that you cannot simply refer to by name like we usually do with our columns in this case on the left you need to know what are the minimum and maximum values for experience and in this case on the right you need to know what is the experience of a character’s mentor and so we solve that problem by writing a new query a nested query and making sure that SQL solves this query first gets the result and then plugs that result back back into the original query to get the data we need there is however a subtle difference between these two queries that turns out to be pretty important in practice and I can give you a clue to what this difference is by telling you that on the right we have something that’s called a correlated subquery and on the left we Define this as uncor related subquery now what does this really mean it means that here on the left our subqueries are Computing the minimum and the maximum experience and these are actually fixed values for all of our characters it doesn’t matter which character you’re looking at the whole data set has the same values from minimum experience and maximum experience so you could even imagine comp Computing these values first before running your queries for example you could say minimum experience is the minimum and maximum experience is the max and then you could imagine replacing these values over here right this will not actually work because you cannot Define variables like this in in SQL but on a logical level you can imagine doing this right because you only need to compute these two once I will revert this here so we don’t get confused on the other hand on the right you will see that the value that is returned by sub by this subquery needs to be computed dynamically for every row this value as you also see in the results is different for every row because every row references a different Mentor ID and so SQL cannot compute this one value here for for all rows at once it has to recompute it for every row and this is why we call it a correlated subquery because it’s connected to the value that is in each row and so it must run for each row and an important reason to distinguish between uncorrelated and correlated subqueries is that you can imagine that correlated subqueries are actually slow slower and more expensive to run because you have you’re running a SQL query for every row at least At The Logical level so this was our introduction to subqueries they allow you to implement more complex logic and as long as you understand it logically you’re off to a great start and then by doing exercises and solving problems you will learn with experience when it’s the case to use them in the last lecture we saw that we could use subqueries to retrieve singular values for example what is the minimum value of experience in my data set but we can also use subqueries and Common Table Expressions as well to create new tables all together so here’s a motivating example for that so what I’m doing in this query right here is that I am scaling the value of level based on the character’s class and you might need this in order to create some balance in your game or for whatever reason now what this does is that if the character is Mage the level gets divided by half or multiplied by 0.5 if the character is Archer or Warrior the level we take the 75% of it and in all other cases the level gains 50% so the details are not very important it’s just an example but the point is that we modify the value of level based on the character class and we do this with the case when statement that we saw in a previous lecture and as you can see in the results we get a new value of power level for each character that you can see here but now let’s say that I wanted to filter my my characters based on this new column of power level say that I wanted to only keep characters that have a power level of at least 15 how would I do that well we know that the wear filter can be used to filter rows so you might just want to go here and add a wear statement and say where power level is equal or bigger than 15 but this is not going to work right we know this cannot work because we know how the logical order of SQL operations works and so the case when column that we create power level is defined here at the select stage but the wear filter occurs here at the beginning right after we Source our table so due to our rules the wear component cannot know about this power level column that will actually get created later so the query that we just wrote actually violates the logical order of SQL operations and this is why we cannot filter here now there is actually one thing that I could do here to avoid using a subquery and get around this error and that’s something would be to avoid using this Alias power level that we assigned here and that the we statement cannot know about and replace it with the whole logic of the case when statement so this is going to look pretty ugly but I’m going to do it and if I run this you will see that we in fact get the result we wanted now in the wear lecture we saw that the wear Clause doesn’t just accept simple logical statements you can use all the calculations and all the techniques that are available to you at the select stage and you can also use case when statements and this is why this solution here actually works however this is obviously very ugly and impractical and you should never duplicate code like this so I’m going to remove this wear Clause over here and show you how you can achieve the same result with a subquery so let me first rerun this query over here so that you can see the results and now what I’m going to do I’m going to select this whole logic over here and wrap it in round brackets and then up here I’m going to say select star from and when I run this new query this data that I’m seeing over here should be unchanged so let us run it and you will see that the data has not changed at all but what is actually happening here well it’s pretty simple usually we say select star from fantasy characters right and by this we indicate the name of a table that our system can access but now instead of a table name we are showing a subquery and this subquery is a piece of SQL logic that obviously returns a table so SQL will look at this whole code and we’ll say say okay there is a outer query which is this one and there is an inner query a nested query which is this one so I will compute this one first and then I will treat this as just another table that I can then select from and now because this is just another table we can actually apply a wear filter on top of it we can say where power level is equal or greater than 15 and you will see that we get the result we wanted just like before but now our code looks actually better and the case when logic is not duplicated if you wanted to visualize this in our schema it would look something like this so the flow of data is the following first we run the inner query that works just like all the other queries we’ve seen until now it starts with the from component which gets the table from the database and then it goes through the usual pipeline of SQL logic that eventually produces a result which is a table next that table gets piped into the outer query the outer query also starts with the from component but now the from component is not redem directly from the dat database it is reading the result of the inner query and now the outer query goes through the usual pipeline of components and finally it produces a table and that table is our result and this process could have many levels of nesting because the inner query could reference another query which references another query and eventually we would get to the database but it could take many steps to get there and to demonstrate how multiple levels of nesting works I will go back to my query over here and I will go into my inner query which is this one and this is clearly referencing the table in the database but now instead of referencing the table I will reference yet an other subquery which can be something like from fantasy characters where is alive equals true select star so I will now run this and we have added yet another subquery to our code this was actually not necessary at all you could add the wear filter up here but it is just to demonstrate the fact that you can Nest a lot of queries within each other the other reason I wanted to show you this code is that I hope you will recognize that this is also not a great way of writing code it can get quite confusing and it’s not something that can be easily read and understood one major issue is that it interrupts the natural flow of reading code because you constantly have to interrupt a query because another nested query is beginning within it so you will read select start from and then here another query starts and this is also querying from another subquery and after reading all of these lines you will find this wear filter that actually refers to the outer query that has started many many lines back and if you find this confusing well I think you’re right because it is and the truth is that when you read code on the job or in the wild or when you see solutions that people propose to coding challenges unfortunately this is something that occurs a lot you have subqueries within subqueries within subqueries and very quickly the code becomes impossible to read fortunately there is a better way to handle this and a way that I definitely recommend over this which is to use common table Expressions which we shall see shortly it is however very important that you understand this way of writing subqueries and that you familiarize yourself with it because whether we like it or not a lot of code out there is written like this we’ve seen that we can use the subquery functionality to define a new table on the Fly just by writing some code a new table that we can then query just like any other SQL table and what this allows us to do is to run jobs that are too complex for a single query and to do that without defining new tables in our database and and storing new tables in our database it is essentially a tool to manage complexity and this is how it works for subqueries so instead of saying from and then the name of a table we open round brackets and then we write a independent SQL query in there and we know that every sqle query returns a table and this is the table that we can then work on what we do here is to select star from this table and then apply a filter on this new column that we created in the subquery power level and now I will show you another way to achieve the same result which is through a functionality called Common Table Expressions to build a Common Table expression I will take the logic of this query right here and I will move it up and next I will give a name to this table I will call it power level table and then all I need to say is with power level table as followed by the logic and now this is just another table that is available in my query and it is defined by the logic of what occurs Within the round brackets and so I can refer to this over here and query it just like I need and when I run this you see that we get the same results as before and this is how a Common Table expression works you start with the keyword with you give an alias to the table that you’re going to create you put as open round brackets write an independent query that will of course return a table under this alas over here and then in your code you can query this Alias just like you’ve done until now for any SQL table and although our data result hasn’t changed I would argue that this is a better and more elegant way to achieve the same result because we have separated in the code the logic for the these two different tables instead of putting this logic in between this query and sort of breaking the flow of this table we now have a much cleaner solution where first we Define the virtual table that we will need and by virtual I mean that we treat it like a table but it’s not actually saved in our database it’s still defined by our code and then below that we have the logic that uses this virtual table we can also have multiple Common Table expressions in our query let me show you what that looks like so in our previous example on subquery we added another part where here instead of querying the fantasy characters table we queried a filter on this characters table and it looked like this we were doing select star where is alive equals true so I’m just reproducing what I did in the previous lecture on subqueries now you will notice that this is really not necessary because all we’re doing here is add a wear filter and we could do this in this query directly but please bear with with me because I just want to show you how to handle multiple queries the second thing I want to tell you is although this code actually works and you can verify for yourself I do not recommend doing this meaning mixing Common Table expressions and subqueries it is really not advisable because it adds unnecessary complexity to your code so here we have a common table expression that contains a subquery and I will rather turn this into a situation where we have two common table expressions and no subqueries at all and to do that I will take this logic over here and paste it at the top and I will give this now an alias so I will call it characters alive but you can call it whatever is best for you and then I will do the keyword as add some lines in here to make it more readable and now once we are defining multiple Common Table Expressions we only need to do the with keyword once at the beginning and then we can simply add a comma and please remember this the comma is very important and then we have the Alias of the new table the as keyword and then the logic for that table all that’s needed to do now is to fill in this from because we took away the subquery and we need to query the characters alive virtual table here and this is what it looks like and if you run this you will get your result so this is what the syntax looks like when you have multiple Common Table Expressions you start with the keyword with which you’re only going to need once and then you give the Alias of your first table as keyword and then the logic between round brackets and then for every extra virtual table that you want to add for every extra Common Table expression you only need to add a comma and then another Alias the ask keyword and then the logic between round brackets and when you are done listing your Common Table Expressions you will omit the comma you will not have a comma here because it will break your code and finally you will run your main query and in each of these queries that you can see here you are totally free to query real tables you know material tables that exist in your database as well as common table Expressions that you have defined in this code and in fact you can see that our second virtual table here is quering the first one however be advised that the order in which you write these Common Table Expressions matters because a Common Table expression can only reference Common Table Expressions that came before it it’s not going to be able to see those that came after it so if I say here instead of from fantasy characters I try to query from power level table you will see that I get an error from bigquery because it thinks it doesn’t recognize it basically because the code is below so the order in which you write them matters now an important question to ask is when should I use subqueries and when should I use common table expressions and the truth is that they have a basically equivalent functionality what you can do with the subquery you can do with a common table expression my very opinionated advice is that every time you need to define a new table in your code you should use a Common Table expression because they are simpler easier to understand cleaner and they will make your code more professional in fact I can tell you that in the industry it is a best practice to use common table Expressions instead of subqueries and if I were to interview you for a data job I would definitely pay attention to this issue but there is an exception to this and this is the reason why I’m showing you this query which we wrote in a previous lect lecture on subqueries this is a query where you need to get a single specific value right so if you remember we wanted to get characters whose experience is above the minimum experience in the data and also below the maximum experience so characters that are in the middle to do this we need to dynamically find at any point you know when this query is being run what is the minimum experience and the maximum experience and the subquery is actually great for that you will notice here that we don’t really need to define a whole new table we just really need to get a specific value and this is where a subquery works well because it implements very simple logic and doesn’t actually break the flow of the query but for something more complex like power level table you know this specific query we’re using here which takes the name takes the level then applies a case when logic to level to create a new column called power level you could this do this with a subquery but I actually recommend doing it with a common table expression and this is a cool blog post on this topic by the company DBT it talks about common table expressions in SQL why they are so useful for writing complex SQL code and the best best practices for using Common Table expressions and towards the end of the article there’s also an interesting comparison between Common Table expressions and subqueries and you can see that of CTE Common Table expressions are more readable whereas subqueries are less readable especially if there there are many nested ones so you know a subquery within a subquery within a subquery quickly becomes unreadable recursiveness is a great advantage of CTE although we won’t examine this in detail but basically what this means is that once you define a Common Table expression in your code you can reuse it in any part of your code you can use it in multiple parts right you can use it in other CTE you can use it in your main query and so on on the other hand once you define a subquery you can really only use it in the query in which you defined it you cannot use it in other parts of your code and this is another disadvantage this is a less important factor but when you define a CTE you always need to give it a name whereas subqueries can be anonymous you can see it very well here we of course had to give a name to both of these CTE but the subqueries that we’re using here are Anonymous however I don’t I wouldn’t say that’s a huge difference and finally you have that CTE cannot be used in a work Clause whereas subqueries can and this is exactly the example that I’ve shown you here because this is a simple value that we want to use in our work clause in order to filter our table subqueries are the perfect use case for this whereas CTE are suitable for more complex use cases when you need to Define entire tables in conclusion the article says CTS are essentially temporary views that you can use I’ve used the term virtual table but temporary view works just as well conveys the same idea they are great to give your SQL more structure and readability and they also allow reusability before we move on to other topics I wanted to show you what an amazing tool to Common Table expressions are to create complex data workflows because Common Table expressions are not just a trick to execute certain SQL queries they’re actually a tool that allows us to build data pipelines within our SQL code and that can really give us data superpowers so here I have drawn a typical workflow that you will see in complex SQL queries that make use of Common Table Expressions now what we’re looking at here is a single SQL query it’s however a complex one because it uses CTE and the query is represented graphically here and in a simple code reference here the blue rectangles represent the Common Table Expressions these virtual tables that you can Define with the CTE syntax whereas the Red Square represents the base query the query at the bottom of your code that ultimately will return the result so a typical flow will look like this you will have a first Common Table expression called T1 that is a query that references a real table a table that actually exists in your data set such as fantasy characters and of course this query will do some work right it can apply filters it can calculate new columns and so on everything that we’ve seen until now and then the result of this query gets piped in to another Common Table expression this one is T2 that gets the result of whatever happen happened at T1 and then apply some further logic to it apply some more Transformations and then again the result gets piped into another table where more Transformations run and this can happen for any number of steps until you get to the final query and in the base query we finally compute the end result that will then be returned to the user so this is effectively a dat pipeline that gets data from the source and then applies a series of complex Transformations and this is similar to The Logical schema that we’ve been seeing about SQL right except that this is one level further because in our usual schema the steps are done by Clauses by these components of the SQL queries but here every step is actually a query in itself so of course this is a very powerful feature and this data pipeline applies many queries sequentially until it produces the final result and you can do a lot with this capability and also you should now be able to understand how this is implemented in code so we have our usual CTE syntax with and then the first table we call T1 and then here we have the logic within round brackets for T1 and you can see here that in the from we are referencing a table in the data set and then for every successive Common Table expression we just add a comma a new Alias and the logic comma new Alias and the logic and finally when we’re done we write our base query and you can see that the base query is selecting from T3 T3 is selecting from T2 T2 is selecting from T1 and T1 is selecting from the database but you are not limited to this type of workflow here is another maybe slightly more complex workflow that you will also see in the wild and here you can see that at the top we have two common table Expressions that reference the the database so you can see here like like the first one is getting data from table one and then transforming it the second one is getting data from table two and then transforming it and next we have the third CTE that’s actually combining data from these two tables over here so we haven’t yet seen how to combine data except through the union um I wrote The Joint here which we’re going to see shortly but all you need to know is that T3 is combining data from this these two parent tables and then finally the base query is not only using the data from T3 but also going back to T1 and using that data as well and you remember we said that great thing about ctes is that tables are reusable you define them once and then you can use them anywhere well here’s an example with T1 because T1 is defined here at the top of the code and then it is referenced by T3 but it is also referenced by the base query so this is another example of a workflow that you could have and really the limit here is your imagination and the complexity of your needs you can have complex workflows such as this one which can Implement very complex data requirements so this is a short overview of the power of CTE and I hope you’re excited to learn about them and to use them in your sequel challenges we now move on to joints which are a powerful way to bring many different tables together and combine their information and I’m going to start us off here with a little motivating example now on the left here I see my characters table and by now we’re familiar with this table so let’s say that I wanted to know for each character how many items they are carrying in their inventory now you will notice that this information is not available in the characters table however this information is available in the inventory table so how exactly does the inventory table works when you are looking at a table for the first time and you want to understand how it works the best question you can ask is the following what does each row represent so what does each row represent in this table well if we look at the columns we can see that for every row of this table we have a specific character id and an item id as well as a quantity and some other information as well such as whether the item is equipped when it was purchased and and so on so looking at this I realized that each row in this table represents a fact the fact that a character has an item right so I know by looking at this table that character id 2 has item 101 and character ID3 has item six and so on so clearly I can use this in order order to answer my question so how many items is Gandalf carrying to find this out I have to look up the ID of Gandalf which as you can see here is six and then I have to go to the inventory table and in the character id column look for the ID of Gandalf right now unfortunately it’s not ordered but I can look for myself here and I can see that at least this row is related to Gandalf because he has character id6 and I can see that Gandalf has item id 16 in his inventory and I’m actually seeing another one now which is this one which is 11 and I’m not seeing anyone uh any other item at the moment so for now based on my imperfect uh visual analysis is I can say that Gandalf has two items in his inventory of course our analysis skills are not limited to eyeballing stuff right we have learned that we can search uh a table for the information we need so I could go here and query the inventory table in a new tab right and I could say give me um from the inventory table where character id equals 6 this should give me all the information for Gandalf and I could say give me all the columns and when I run this I should see that indeed we have uh two rows here and we know that Gandalf has items 16 and 11 in his inventory we don’t know exactly what these items are but we know that he’s carrying two items so that’s a good start okay but uh what if I wanted to know which items Frodo is carrying well again I can go to the characters table and uh look up the name Frodo and I find out that Frodo is id4 so going here I can just plug that uh number into my we filter and I will find out that Frodo is carrying a single type of item which has id9 although it’s in a quantity of two and of course I could go on and do this for every character but it is quite impractical to change the filter every time and what if I wanted to know how many items each character is carrying or at least which items each character is carrying all at once well this is where joints come into play what I really want to do in this case is to combine these two tables into one and by bringing them together to create a new table which will have all of the information that I need so let’s see how to do this now the first question we must answer is what unites these two tables what connects them what can we use in order to combine them and actually we’ve already seen this in our example um the inventory table has a character id field which is actually referring to the ID of the character in the character’s table so we have two columns here the character id column in inventory and the ID column in characters which actually represent the same thing the identifier for a character and this logical connection the fact that these columns repres repr the same thing can be used in order to combine these tables so let me start a fresh query over here and as usual I will start with the from part now where do I want to get my data from I want to get my data from the characters table just as we’ve been doing until now however the characters table is not not enough for me anymore I need to join this table on the fantasy. inventory table so I want to join these two tables how do I want to join these two tables well we know that the inventory table has a character id column which is the same as the character tables ID column so like we said before these two columns from the different tables they represent the same thing so there’s a logical connection between them and we will use it for the join and I want to draw your attention to the notation that we’re using here because in this query we have two tables present and so it is not enough to Simply write the name of columns it is also necessary to specify to which table each column belongs and we do it with this notation so the inventory. character uh is saying that the we are talking about the character id colum in the inventory table and the ID column in the characters table so it’s important to write columns with this notation in order to avoid ambiguity when you have more than one table in your your query so until now we have used the from uh Clause to specify where do we want to get data from and normally this was simply specifying the name of a table here we are doing something very similar except that we are creating a new table that is obtained by combining two pre-existing tables okay so we are not getting our data from the characters table and we are not getting it from the inventory table but we are getting it from a brand new table that we have created by combining these two and this is where our data lives and to complete the query for now we can simply add a select star and you will now see the result of this query so let me actually make some room here and expand these results so I can show you what we got and as you can see here we have a brand new table in our result and you will notice if you check the columns that this table includes all of the columns from the characters table and also all of the columns from the inventory table as as you can see here and they have been combined by our join statement now to get a better sense of what’s Happening let us get rid of this star and let us actually select the columns that we’re interested in and once again I will write columns with this notation in order to avoid ambiguity and in selecting these columns uh I will remind you that we have all of the columns from the characters table and all of the columns from the inventory table to choose from so what I will do here is that I will take the ID columns from characters and I will take the name column from characters and then I will want to see the ID of the item so I will take the inventory table and the item id column from that table and from the inventory table I will also want to see the quantity of each item and to make our results clearer I will order my results by the characters ID and the item ID and you can see here that we get the result that we needed we have all of our characters here with their IDs and their name and then for each character we can tell which items are in their inventory so you can see here that Aragorn has item id4 in his inventory in quantity of two he also has Item 99 so because of this Aragorn has two rows if we look back at Frodo we see the uh information that we retrieved before and the same for Gandalf who has these two items so we have combined the characters table and the inventory table to get the information that we needed what does each row represent in our result well it’s the same as the inventory table each row is a fact which is that a certain character possesses a certain item but unlike the inventory table we now have all the information we want for a character and not just the ID so here we’ve uh we’re showing the name of each character but we could of course select more columns and get more information for each character as needed now a short note on notation when you see SQL code in the wild and u a query is joining on two or more tables people uh you know programmers were usually quite lazy and we don’t feel like writing the name of the table all all of the time right like we we’re doing in this case with characters so what we usually do is that we add an alias um on the table like this so from fantasy characters call it C we will join on inventory call it I and then basically we use this Alias um everywhere in the query both in the instructions for joining and in the column names and the same with characters so I will substitute everything here and and yes maybe it’s a bit less readable but it’s faster to write and we programmers are quite lazy so we’ll often see this notation and you will often also see that in the code we omit the as keyword which can be let’s say implicit in SQL code and so we write it like this from fantasy. character C join uh fantasy. inventory i and then C and I refer to the two tables that we’re joining and I can run this and show you that the query works just as well now we’ve seen why join is useful and how it looks like but now I want you to get a detailed understanding of how exactly the logic of join works and for this I’m going to go back to my spreadsheet and what I have here is my characters table and my inventory table these are just like you’ve seen them in big query except that I’m only taking um four rows each in order to make it simpler for the example and what you see here is the same query that I’ve just run on big query this is a t a query that takes the characters table joins it on the inventory table on this particular condition and then picks a few columns from this so let us see how to simulate this query in Google Sheets now the first thing I need to do is to build the table that I will run my query on because as we’ve said before the from part is now referencing not the characters table not the inventory table but the new table which is built by combining these two and so our first job is to build this new table and the first step to building this new table is to take all of the columns from characters and put them in the new table and then take all of the columns from inventory and then put them in the new table and what we’ve obtained here is the structure of our new table the structure of our new table is uh simply created by taking all of The Columns of the T table on the left along with all of the columns from the table on the right now I will go through each character in turn and consider the join condition the join condition is that the ID of a character is present in the character id column of inventory so let us look at my first character um we have Aragorn and he has ID one now is this ID present in the character id column yes I see it here in the first row so we have a match given that we have a match I will take all of the data that I have in the characters table for Aragorn and then I will take all of the data in the inventory table for the row that matches and I have built here my first row do I have any other Row in the inventory table that matches yes the second row also has a character id of one so because I have another match I will repeat the operation I will will take all of the data that I have in the left table for Aragorn and I will add all of the data from the right column in the row that matches now there are no more matches for id1 uh in the inventory table so I can proceed and I will proceed with Legolas he has character id of two question is there any row that has the value two in the character id column yes I can see it here so I have another match so just like before I will take the information for Legolas and paste it here and then I will take the matching row which is this one and paste it here we move on to gimly because there’s no other matches for Legolas now gimly has ID3 and I can see a match over here so I will take the row for gimly paste it here and then take the matching row character id 3 and paste it here great finally we come to Frodo character id for is there any match for this character I can actually find no match at all so I do nothing this row does not come into the resulting table because there is no match and this completes the job of this part of the query over here building the table that comes from joining these two tables this is my resulting table and now to complete the query I simply have to pick the columns that the query asks for so the First Column is character. ID which is this column over here so I will take it and I will put it in my result the second column I want is character. name which is this column over here the third column is the item id column which is this one right here and finally I have quantity which is this one right here and this is the final result of my query and of course this is just like any other SQL table so I can use all of the other things I’ve learned to run Logic on this table for example I might only want to keep items that are present in a quantity of two and so to do that I will simply add a wear filter here and I will refer uh the inventory table because that’s the parent table of the quantity column so I will say I will say i. quantity um bigger or equal to two and then how my query will work is that first it will build this table like we’ve seen so it will do this stage first and then it will run the wear filter on this table and it will only keep the rows where quantity is at least two and so as a result we will only get this row over here instead of this result that we see right here H except that um we will of course also have to only keep the columns that are specified in the select statement so we will get ID name um Item ID and quantity so this will be the result of my query after I’ve added a wear filter so let us actually take this and add it to B query and make sure that it works so so I have to add that after the from part and before the order by part right this is the order and after I run this I will see that indeed I get um Aragorn and Frodo is not exactly the same as in our sheet but that’s because our sheet has um less data but uh this is what we want to achieve and now let us go back to our super important diagram of the order of SQL operation and let us ask ourselves where does the join fit in in this schema and as you can see I have placed join at the very beginning of our flow together with the from because the truth is that the joint Clause is not really separate from the from CL Clause they are actually one and the same component in The Logical order of operations so as you remember the first stage specifies where our data lives where we do we want to get our data from and until now we were content to answer this question with a single table name with the address of a single table because all the data we needed was in just one table and now instead of doing this we are taking it a step further we are saying our data lives in a particular combination of two or more tables so let me tell you which tables I want to combine and how I want to combine them and the result of this will be of course yet another table and then this table will be the beginning of my flow and after that I can apply all the other operations that I’ve come to know uh on my table and it will work just like U all our previous examples the result of a join is of course just another table so when you look at a SQL query and this query includes a join you really have to see it as one and the same with the front part it defines the source of your data by combining tables and everything else that you do will be applied not to a single table not to any of the tables that you’re combining everything that you do will be applied to the resultant table that comes from this combination and this is why from and join are really the same component and this is why they are the first step in The Logical order of SQL operations let us now briefly look at multiple joints because sometimes the data that you need is in three tables or four tables and you can actually join as many tables as you want uh or at least as many tables as your system uh allows you to join before it becomes too slow so we have our example here from before we have each character and we have their name and we know which items are in their inventory but we actually don’t know what the items are we just know their ID so how can I know uh that if Aragorn has item four what item does Aragorn actually have what is the name of this item now obviously this information is available in the items table that you have here on the right and you can see here that we have a name column and just like before I can actually eyeball it I can look for it myself I know that I’m looking for item id 4 and if I go here and uh I go to four I can see that this item is a healing potion and now let us see how we can add this with the join so now I will go to my query and after joining with characters in inventory I will take that result and simply join it on a third table so I will write join on fantasy. items and I can call this it to use a uh brief form uh because I am lazy as all programmers are and now I need to specify the condition on which to join so the condition is that the item ID column which actually came from the inventory table right that’s its parent so I’m going to call it inventory. item um ID except that yeah I’m referring to inventory as a simple I that is the brief form is the same as the items table the ID column in the items table and now that I’ve added my condition the data that I’m searcing is now a combination of these three tables and in my result I now have access to The Columns of the items table and I can access these columns simply by referring to them so I will say it. name and some other thing it. power and after I run this query I should be able for each item to see the name and the power right so Aragorn has a healing potion with power of 50 Legolas has a Elven bow with power of 85 and so on now you may have noticed something a bit curious and it’s that name here is actually written as name1 and can you figure out why this is happening well well it’s happening because there’s an ambiguity right the characters table has a column called name and the items table also has a column called name and because bigquery is not referring to the columns the way we are doing it right by saying the the parent table and then the name of the column it uh it would find itself in a position of having two identically named columns so the second one uh it tries to distinguish it by adding underscore one and how we can remedy this is by renaming the column to something more meaningful for example we could say call this item name which would be a lot clearer for whoever looks at the result of our query and as you can see now the name makes more sense so you can see that the multiple join is actually nothing new because when we join the first time like we did before we have combined two two tables into a new one and then this new table gets joined to a third table so it’s simply repeating the join operation twice it’s nothing actually new but let us actually simulate a multiple join in our spreadsheet to make sure that we understand it and that it’s nothing new so again I have our tables here but I have added the items table which we will combine and I’ve written here our query right so take the characters table and join it with inventory uh like we did before and then take the result of that table and join it to items and here we have the condition so the first thing we need to do is to process our first join and this is actually exactly what we’ve done before so let us do it again first of all the combined table uh characters and inventory its structure is obtained by taking all the columns of characters and then all the columns of inventory and putting them side by side and this is the result table now for the logic of this table I will now do it faster because we’ve done it before but basically we get the first character id1 it has two matches so I’ll actually take this values and put them into two rows and for the inventory part I will simply call copy these two rows to um complete my match then we have Legolas there is one match here so I will take the left side and I will take so I’m looking for id2 so I will take this row over here that’s all we have and then we have gimle and he also has one match so I’ll will take it here and the resulting column and then finally Frodo has no match so I will not add him to my result this is exactly what we’ve done before so now that we have this new table we can proceed with our next join which is with items okay so the resulting table will be the result of our first join combined with items and to show you that we’ve already computed uh this and now it’s one table I have added round brackets now the rules for joining are just the same so take all of the columns in the left side table and then take all of the columns in the right side table and now we have the resulting structure of our table and then let us go through every row so let us look at the first row what does the joint condition say Item ID needs to be in the ID table of items so I can see a match here so I will simply take this row on the left side and the matching row on the right side and add it here second row the item ID is four do I have a match yes I can see that I have a match so I will paste the row on the left and the mat matching row on the right third column item id 2 do I have a match no I don’t so I don’t need to do anything and in the final row item id 101 I don’t see a match so I don’t have to do anything and so this is my final result in short multiple join works just like a normal join combine the first two tables get the resulting table and then keep doing this until you run out of joins now there’s another special case of join uh which is the self join and this is something that people who are getting started with SQL tend to find confusing but I want to show you that there’s nothing uh confusing about it because really it’s just a regular join that works just like all the other joints that we’ve seen there’s nothing actually special about it so we can see here uh the characters table and you might remember that for each character we are we have a column of Mentor ID now in a lot of cases this column has value null so it means that there’s nothing there but in some cases there is a value there and what this means is that this particular character so we are looking at number three uh that is Saruman uh this particular character has a mentor and who is this Mentor uh all we know is that their ID is six and it turns out that the ID in this column is referring to the ID in the characters table so to find out who six is I just have to look who has an ID of six and I can see that it is Gandalf so by eyeballing it I know that San has a mentor and that Mentor is Gandalf and then elron also has the same Mentor which is Gand so I can solve this by eyeballing the table but how can I get a table that shows for each character who has a mentor who their Mentor is it turns out that I have to take the character’s table and join it on the characters table on itself so let’s see how that works in practice so let me start a new query here on the right and so my goal here is to list every character in the table and then to also show their Mentor if they have one so I will of course have to get the characters table for this and the first time I take this table it is simply to list all of the characters right so to remind myself of that I can give it a label which is chars now as you know each character has a mentor ID value and but to find out who like what is the name of this Mentor I actually need to look it up in the characters table so to do this I will join on another instance of the characters table right this is another let’s say copy of the same data but now I’m going to use it for a different purpose I will not use it to list my characters I will use it to get the name of the mentor so I will call this mentors to reflect this use now what is The Logical connection between these two copies of the characters table each character in my list of characters has a mentor ID field and I want to match this on the the ID field of my mentor table so this is The Logical connection that I’m looking for and I can now add a select star to quickly complete my query and see the results over here so the resulting table has all of The Columns of the left table and all of The Columns of the right table which means that the columns of the characters table will be repeated uh twice in the result as you can see here but on the left I simply have my list of characters okay so the first one is Saruman and then on the right I have the data about their Mentor so Saran has a mentor ID of six and then here starts the data about the mentor he has ID of six and his name is Gandalf so you can see here that our self jooin has worked as intended but this is actually a bit messy uh we don’t need uh all of these columns so let us now select Only The Columns that we need so from my list of characters I want the name and then from the corresponding Mentor I also want the name and I will label these columns so that they make sense to whoever is looking at my data so I will call this character character name and I will call this Mentor name and when I run this query you can see that quite simply we get what we wanted we have the list of all our characters at least the ones who have a mentor and for each character we can see the name of their Mentor so a self join works just like any other join and the key to avoiding confusion is to realize that you are joining on two different copies of the same data okay you’re not actually joining on the same exact table so one copy of fantasy characters we call characters and we use for a purpose and then a second copy we call mentors and we use for another purpose and when you realize this you see that you are simply joining two tables uh and all the rules that you’ve learned about normal joints apply it just so happens that in this case the two tables are identical because you’re getting the data from the same source and to drive the point home let us quickly simulate this in our trusty spreadsheet and so as you can see here uh I have the query that I’ve run in B query and we’re now going to simulate it so the important thing to see here is that that we’re not actually joining one table to itself although that’s what it looks like we’re actually joining two tables which just happen to look the same okay and so one is called chars and one is called mentors based on the label that we’ve given them but then once we join them the rules are just the same as we’ve seen until now so to create the structure of the resulting table take all the columns from the left left and then take all the columns from the right and then go row by row and look for matches based on on the condition now the condition is that Mentor ID in chars needs to be in the ID column of mentors so first row Aragorn has Mentor 2 is this in the ID column yes I can see a match here so let me take all the values from here and all the values from the matching rows paste them together are there any other matches no second row we’re looking for Mentor ID 4 do we have a match yes I can see it here so let me take all of the values from the left and all of the values from the matching row on the right now we have two more rows but but as you can see in both cases Mentor ID is null which means that they have no mentor and basically for the purposes of the join we can ignore these rows we are not going to find a match in these rows in fact as an aside even if there was a character whose ID was null uh we wouldn’t match with Mentor ID null on a character whose ID was null because in squl in a sense null does not equal null because null is not a specific value but it represents the absence of data so in short when Mentor ID is null we can be sure that in this case uh there will be no match and the row will not appear in the join now that we have our result we simply need to select the columns that we want and so the first one is name which comes from the charge table which is this one over here and the second one is name that comes from the mentor table which is this one over here and here is our result so that’s how a self join works so until now we have seen uh joint conditions which are pretty strict and and straightforward right so there’s a column in the left table and there’s a column in the right table and they represent the same thing and then you look for an exact match between those two columns and typically they’re an ID number right so one table has the item id the other table also has the item ID and then you look for an exact match and if there’s an exact match you include the row in the join otherwise not that’s pretty straightforward but what I want to show you here is that the join is actually much more flexible and and powerful than that and you don’t always need you know two columns that represent the exact same thing or an exact match in order to write a joining condition in fact you can create your own you know complex conditions and combinations that decide how to join two tables and for this you can simply use the Boolean algebra magic that we’ve learned about in this course and that we’ve been using for example when working on the wear filter so so let us see how this works in practice now I’ve tried to come up with an example that will illustrate this so let’s say that we have a game you know board game or video game or whatever and we have our characters and we have our items okay and in our game um a character cannot simply use all of the items in the world okay there is a limit to which items a character can use and a limit is based on the following rule um let me write it here as a comment and then we will uh use it in our logic so a character can use any item for which the power level is equal or greater than the characters experience divided by 100 okay so this is just a rule uh that exists in our game and now let us say that we wanted to get a list of all characters and the items that they can use okay and this is clearly uh a case where we would need a join so let us actually write this query I will start by getting my data from fantasy. characters and I will call this c as a shorthand and I will need to join on the items table right and what is the condition of the join the condition of the join is that the character’s experience divided by 100 is greater or equal than the items power level and I forgot here to add a short hand I for the items table so this is the condition that refects our Rule and out of this table that I’ve created I would like to see the characters name and the characters experience divided by 100 and then I would like to see the items name and the items power to make sure that my um join is working as intended so let us run this and look at the result so this looks a bit weird because we haven’t given a label to this column but basically I can see um that I have Gandalf and his experience divided by 100 is 100 and he can wear the item Excalibur that has a power of 100 which satisfies our condition let me actually order by character name so that I can see in one place all of the items that a character can wear so we can see that Aragorn is first and his experience divided by 100 is 90 and then uh this is the same in all all of these rows that we see right now but then we see all of the items that Aragorn is allowed to use and we see their power and in each case you will see that their power does not exceed this value on the left so the condition uh that we wrote works as intended so as you can see what we have here is a Boolean expression just like the ones we’ve seen before which is a logical statement that eventually if you run it it evaluates to either true or false and all of the rules that we’ve seen for Boolean Expressions apply here as well for example I can decide that this rule over here does not apply to Mages because Mages are special and then I can say that if a character is Mage then I want them to be able to use all of the items well how can I do this in this query can you pause the video and figure it out so what I can do is to Simply expand my Boolean expression by adding an or right and what I want to test for is that character class equals Mage so let me check for a second that I have class and I have Mage so this should work and if I run this going through the result I will not do it but you can uh do it yourself and and verify for yourself that if a character is a Mage you will find out that they can use all of the items and this of course is just a Boolean expression um in which you have two statements connected by an or so if any of this is true if at least one of these two is true then the whole statement will evaluate to true and so the row will match if you have trouble seeing this then go back to the video on the Boolean algebra and uh everything is explained in there so this is just what we did before when we simulated The Joint in the spreadsheet you can imagine taking the left side table which is uh characters and then going row by row and then for the first row you check all of the rows in the right side table which is items all of the rows that have a match but this time you won’t check if the ID corresponds you will actually run this expression to see whether there is a match and when this expression evaluates as true you consider that to be a match and you include the row in the join however if this condition does not evaluate to true it’s not a match and so the row is not included in the join so this is simply a generalization from the exact match which shows you that you can use any conditions in order to join uh two tables now I’ve been pretending that there is only one type of join in SQL but that is actually not true there are a few different types of join that we need to know so let us see uh what they are and how they work now this is the query that we wrote before and this is exactly how we’ve written it before and as you can see we’ve simply specified join but uh it turns out that what we were doing all the time was something called inner join okay and now that I’ve written it explicitly you can see that if I rerun the query I will get exactly the same results and this is because the inner join is by far the most common type of join that you find in SQL and so in many uh styles of SQL such as the one used by bigquery they allow you to skip this specification and they allow you to Simply write join and then it is considered as an inner join so when you want to do an inner join you have the choice whether to specify it explicitly or to Simply write join but what I want to show you you now is another type of join called Left join okay and to see how that works I want to show you um how we can simulate this query in the spreadsheet so as you can see this is very similar to what we’ve done before I have the query uh that I want to simulate and notice the left join and then I have my two tables now what is the purpose of the left join in the previous examples which were featuring the inner join we’ve seen that when we combine two tables with an inner join the resulting table will only have rows that have a match in both tables okay so what we did is that we went through every Row in the characters table and if it had a match in the inventory table we kept that row but if there was no match we completely discarded that row but what if we wanted in our resulting table to see all of the characters to make sure that our list of characters was complete regardless of whether they had a match in the inventory table this is what left join is for left join exists so that we can keep all of the rows in the left table whether they have a match or not so let us see that in practice okay so when we need to do a left join between characters and inventory so first of all I need to determine the structure of the resulting table and to do this I will take all of the columns from the left table and all of the columns from the right table nothing new there next step let us go row by Row in the left table and look for matches so we have Aragorn and he actually has two matches uh by now we’ve uh remembered this so these two rows have a match in character id with the ID of characters so I will take these two rows and add them to my resulting table next is Legolas and I see a match here so I will take the rows where Legolas matches and put it here it’s only one row actually gimly has also a single match so I will create the row over here um and so this is the match for gimly and of course I can ensure that I’m doing things correctly by looking at this ID column and uh this character id column over here and they have to be identical right if they’re not then I’ve made a mistake and finally we come to Frodo now Frodo you will see does not have a match in this table so before we basically discarded this row because it had no match right now though we are dealing with the left join that means that all of the rows in the characters table need to be included so I don’t have a choice I need to take this row and include it and add it here and now the question is what values will I put in here well I cannot put any value from the inventory table because I don’t have a match so the only thing that I can do is to put NS in here NS of course represent the absence of data so they’re perfect for this use case and that basically completes uh the sourcing part of our left join now you may have noticed that there is an extra row here in inventory which does not have a match right it is referred into character id 10 but there is no character id 10 so here the frao row also did not have a match but we included it so should we include this row as well the answer is no why not because this is a left joint okay so left joint means that we include all of the rows in the left table even if they don’t have a match but we do not include rows in the right table when they do not have a match okay this this is why it’s a left join so but if you’re still confused about this don’t worry because it will become clearer once we see the other types of join and of course for the sake of completeness I can actually finish the query by selecting my columns which would be the uh character id and the character name and the item ID and the item quantity and this is my final result and in the case of Frodo we have null values which tells us that this row found no match in the right table which in this case means that Frodo does not have any items now that you understand the left join you can also easily understand the right joint it is simply the symmetrical operation to the left joint right right so whether you do characters left joint inventory or you do inventory right join characters the result will be identical it’s just the symmetrical operation right this is why I wrote here that table a left joint b equals table B right joint a so hopefully that’s pretty intuitive but of course if I I did characters right join inventory then the results would be reversed because I would have to keep all of the rows of inventory regardless of whether they have a match or not and only keep rows in characters which have a match so if you experiment for yourself on the data you will easily convince yourself of this result let us now see the left joint in practice so remember the query from before um where we take each character and then we see their Mentor this is the code exactly as we’ve written it before and so now you know that this is an inner join because when you don’t specify what type of join you want SQL assumes it’s an inner join at least that’s what the SQL in bigquery does and you can see that if I write inner join um I think I have a typo there uh the result is absolutely identical and in this case we’re only including characters who have a mentor right we are missing out on characters who don’t have a mentor meaning that Mentor ID is null because in the inner join there is no match and so they are discarded but what would happen if I went here and instead turn this into a left join what I expect to happen is that I will keep all of my characters so all of the rows from the left side table regardless of whether they have a match or not regardless of whether they have a mentor or not and so let us run this and let us see that this is in fact the case I now have a row for each of my characters and I have a row for Gandalf even though Gandalf does not have mentor and so I have a null value in here so the left join allows me to keep all of the rows of the left table now we’ve seen the inner join the left join and the right join which are really the same thing just symmetrical to each other and finally I want to show you the full outer join this is the last type of join that I want to that I want to show you now you will see that a full outer joint is like a combination of all of the joints that we’ve seen until now so a full outer join gives us all of the rows uh that have a match in the two tables plus all of the rows in the left table that don’t have a match with the right table plus all of the rows in the right table that don’t have a match in the left table so let us see how that works in practice what I have here is our usual query but now as you can see I have specified a full outer join so let us now simulate this join between the two tables now the first step as usual is to take all of the columns from the left table and all of the columns from the right table to get the structure of the resulting table and now I will go row by Row in the left table so as usual we have Aragorn and you know what I’m already going to copy it here because even if there’s not a match I still have to keep this row uh because this is a full outer joint and I’m basically not discarding any row now that I’ve copied it is there a match well I already know from the previous examples that there are two rows uh in the inventory table that match because they have character id one so I’m just going to take them and copy them over here and in the second row I will need to replicate these values perfect let me move on to Legolas and again I can already paste it because there’s no way that I’m going to discard this row but of course we know that Legolas has a m match and moving quickly cuz we’ve already seen this gimly has a match as well and now we come to Frodo now Frodo again I can already copy it because I’m keeping all the rows but Frodo does not have a match so just like before with the left join I’m going to keep this row but I’m going to add null values in the columns that come from the invent table so now I’ve been through all of the rows in the left table but I’m not done yet with my join because in a full outer join I have to also include all of the rows from the right table so now the question is are there any rows in the inventory table that I have not considered yet and for this I can check the inventory ID from my result 1 2 3 4 and compare it with the ID from my table 1 2 3 4 5 and then I realize that I have not included row number five because it was not selected by any match but since this is a full outer join I will add this row over here I will copy it and of course it has no correspondent uh in the left table so what do I do once again I will insert null values and that completes the first phase of my full outer join the last phase is always the same right pick the columns that are listed in the select so you have the ID the name Item ID and quantity and this completes my full outer join so remember how I said that a full outer join is like an inner join plus a left join plus a right join here is a visualization that demonstrates now in the result the green rows are the rows in which you have a match on the left table and the right table right and these rows correspond to the inner join and if you run an inner join this this will be the only rows that are returned right now the purple row is including a row that is present in the left table but does not have any match in the right table so if you were to run a left join what would the result be a left joint would include all of the green rows because they have a match and and additionally they would also include the purple row because in the left joint you keep all of the rows from the left if on the other hand you were to run a right join and you wouldn’t like swap the names of the tables or anything right you would do characters right join inventory you would get of course all of the green rows because they are a match Additionally you would get the blue row at the end because this row is present in the right table even though there’s no match and in the right join we want to keep all the rows that are in the right table and finally in a full outer join you will include all of these rows right so first of all all of the rows that have a match and then all of the rows in the left table even though they don’t have a match and finally all of the rows in the right table even though they don’t have a match and these are the three or four types of joint that you need to know and that you will find useful in solving your problems now here’s yet another way to think about joints in SQL and to visualize joints which you might find helpful so one way to think about SQL tables is that a table is a set of rows and that joints correspond to different ways of uh combining sets and you might remember this from school this is a v diagram it represents the relation uh between uh two sets and the elements that are inside these two sets so you can take set a to be our left table uh containing all of the rows from um the left table and set B to be our right table with all of the rows from the right table and in the middle here you can see that there is an intersection between the sets this intersection represents the rows that have a match uh so this would be the rows that I have colored green in our example over here so what will happen if I select if I want to see only the rows that are a match only the rows that belong in both tables let me select this now and you can see that this corresponds to an inner joint because I only want to get the rows that have a match then what would happen if I wanted to include all of the rows in the left table regardless of whether they have a match or not to what type of join does that correspond I will select it here and you can see that that corresponds to a left join the left join produces a complete set of records from table a with the matching records in table B if there is no match the right side will contain null likewise if I wanted to keep all of the rows in uh table B including the ones that match with a I would of course get a right join which is just symmetrical to a left join finally what would I have to do to include all of the rows from both tables regardless of whether they have a match or not if I do this then I will get a full outer join so this is just one way to visualize what we’ve already seen there is one more thing you can actually realize from this uh tool which is in some cases you might want to get all of the records that are in a except those that match in B so all of the record that records that a does not have in common with b and you can see how you can actually do this this is actually a left join with an added filter where the b key is null so what does that mean the meaning will be clear if I go back to our example for the left join you can see that this is our result for the left join and because Frodo had no match in the right table the ID column over here is null so if I take this table and I apply a filter where ID where inventory ID is null I will only get this result over here and this is exactly the one row in the left table that does not have a match in the right table so this is more of a special case you don’t actually see this a lot in practice but I wanted it wanted to show it briefly to you in case you try it and get curious about it likewise the last thing that you can do you could get all of the rows from A and B that do not have a match so the set of Records unique to table a and table B and this is actually very similar you do a full outer join and you check that either key is null so either inventory ID is null or character id is null and if you apply that filter you will get these two rows which is the set of rows that are in a and only in a plus the rows that are in B and only in B once again I’ve honestly never used this in practice I’m just telling you for the sake of completeness in case you get curious about it now a brief but very important note on how SQL organizes data so you might remember from the start of the course that I’ve told you that in a way SQL tables are quite similar to spreadsheet tables but there are two fundamental difference one difference is that each SQL table has a fixed schema meaning we always know what the columns are and what type of data they contain and we’ve seen how this works extensively the second thing was that SQL tables are actually connected with each other which makes SQL very powerful and now we are finally in a position to understand just exactly how SQL tables can be connected with each other and this will allow you to understand how SQL represents data so I came here to DB diagram. which is a very uh nice website for building representations of SQL data and this is uh this type of um of chart of representation that we see here is also known as ER as you can see me writing here which is stands for entity relationship diagram and it’s basically a diagram that shows you how your data is organized in your SQL system and so you can see a representation of each table uh this is the example that’s shown on the web website and so you have three tables here users follows and posts and then for each table you can see the schema right you can see that the users table has four columns one is the user ID which is an integer the other is the username which is varar this is another way of saying string so this is a piece of text rooll is also a piece of text and then you have a Tim stamp that shows when the user was created and the important thing to notice here is that these tables are actually they’re not they don’t exist in isolation but they are connected with each other they are connected through these arrows that you see here and what do these arrows represent well let’s look at the follows table okay so each row of this table is a fact shows that one user follows another and so in each row you see the ID of the user who follows and the ID of the user who is followed as well as the time when this event happened and what are these uh arrows telling us they’re telling us that the IDS in this table are the same thing as the user ID column in this table which means that you can join the follows table with the users table to get the information about the two users that are here the user who is following and the user who is followed so like we’ve seen before a table has a column which is the same thing as another tables column which means that you can join them to combine their data and this is how in SQL several tables are connected with each other they are connected by logical correspondences that allow you to join those tables and combine their data likewise you have the post table and each row represent a post and each post post has a user ID and what this arrow is telling you is that uh you can join on the user table using this ID to get all the information you need about the user who has created this post now of course as we have seen you are not limited to joining the tables along these lines you can actually join these tables on whatever condition you can think of but this is a guarantee of consistency between these tables that comes from how the data was distributed and it’s a guarantee it’s a promise that you can get the data you need by joining on these specific columns and that is really all you need to know in order to get started with joints and use them to explore your data and solve SQL problems to conclude this section I want to go back to our diagram and to remind you that from and join are really one and the same they are the way for you to get the data that you need in order to answer your question and so when the data is in one table alone you can get away with just um using the from and then specifying the name of the table but often your data will be distributed in many different tables so you can look at the ER diagram such as this one if you have it to figure out how your uh data works and then once you decided which tables you want to combine you can write a from which combines with a join and so create a new table uh which is a combination of two or more tables and then all of the other operations that you’ve learned will run on top of that table we are finally ready for a in-depth discussion of grouping and aggregations in SQL and why is this important well as you can see I have asked Chad GPT to show me some typical business questions that can be answered by data aggregation so let’s see what we have here What’s the total revenue by quarter how many units did did each product sell last month what is the average customer spent per transaction which region has the highest number of sales now as you can see these are some of the most common and fundamental business questions um that you would be asking when you do analytics and this is why grouping and aggregation are so important when we talk about SQL now let’s open our date once again in the spreadsheet and see what we might achieve through aggregation so I have copied here four columns from my characters table Guild class level and experience and I’m going to be asking a few questions the first question which you can see here is what are the level measures by class so what does this mean well earlier in the course we looked at aggregations and we call them simple aggregations because we were running them over the whole table so you might remember that if I select the values for level here I will get a few different aggregations in the lower right of my screen so what you can see here is that I have a count of of 15 which means that there are 15 rows for level and that the maximum level is 40 the minimum is 11 and then I have an average level of 21.3 more or less and if you sum all the levels you get 319 so this is already some useful information but now I would like to take it a step further and I would like to know this aggregate value within each class so for example what is the maximum level for warriors and what is the maximum level for Hobbits are they different how do they compare this is where aggregation comes into play so let us do just that now let us find the maximum level Within each class and let us see how we might achieve this now to make things quicker I’m going to sort the data to fit my purpose so I will select the range over here and then go to data sort range and then in the in the advanced options I will say that I want to sort by column B because that’s my class and now as you can see the data is ordered by class and I can see the different values for each class next I will take all the different values for class and separate them just like this so first I have Archer then I have hobbit then I have Mage and finally I have Warrior so here they are they’re all have their own sp right now finally I just need to take to compress each of these ranges so that each of them covers only one row so for Archer I will take the value of the class Archer and then I will have to compress these numbers to a single number and to do that I will use the max function this is the aggregation function that we are using and quite intuitively this function will look at the list of values we’ll pick the biggest one and it will reduce everything to the biggest value and you can also see it here in this tool tip over here doing the same for Hobbit compress all of the values to a single value and then compress all of the numbers to a single number by applying a an aggregation function so I’ve gone ahead and done the same for mage and Warrior and all that’s left to do is to take this and bring all these rows together and this is my result this is doing what I have asked for I was looking to find the maximum level Within each class so I have taken all the unique values of class and then all the values of level within each class I have compressed them to a single number by taking the maximum and so here I have a nice summary which shows me what the maximum level is for each class and I can see that mes are much more powerful than everyone and that Hobbits are much more weaker according to this measure I’ve learned something new about my data now crucially and this is very important in my results I have class which is a grouping field and then level which is an aggregate field okay so what exactly do I mean by this now class is a grouping field because it divides my data in several groups So based on the value of class I have divided my data as you see here so Archer has three values Hobbit has four values and so on level is an aggregate field because it was obtained by taking a list of several values so here we have three here we have four and in the wild we could have a thousand or 100 thousand or Millions it doesn’t matter it’s a list of multiple values and then I’ve taken these values and compressed them down to one value I have aggregated them down to one value and this is why level is an aggregate field and whenever you work with groups and aggregations you always have this division okay you are have some fields that you use for grouping you know for subdividing your data and then you have some fields on which you run aggregations and aggregations such as for example looking at a list of value and taking the maximum value or the average or the minimum and so on aggregations are what allow you to understand the differences between groups so after aggregating you can say oh well the the Mages are certainly much more powerful than the hobbits and so on and if you look work with the dashboards like Tableau or other analytical tools you will see that another way to refer to these terms is by calling the grouping Fields dimensions and the aggregate Fields measures okay so I’m just leaving it here you can say grouping field and aggregate field or you can talk about dimensions and measures and they typically refer to the same type of idea now let’s see how I can achieve the same result in SQL so I will start a new query here and I want to get data from fantasy. characters and after I’ve sourced this table I want to Define my groups okay so I will use Group by which is my new clause and then here I will have to specify the grouping field I will have to specify the group that I want to use in order to subdivide the data and that group is class in this case after that I will want to define the columns that I want to see in my result so I will say select and first of all I want to see the class and then I want to see the maximum level within each class so if I run this you will see that I get exactly the same result that I have in Google Sheets so we have seen this before Max is an aggregation function it takes a list of Val vales and then compresses them down to a single value right except that before we were running it on at the level of the whole table right so if I select this query alone and run it what do you expect to see I expect to see a single value because it has looked at all the levels in the table and it has simply selected the biggest one it has reduced all of them to a single value however if I run it after defining a group buy then this will run not on the whole table at once it will run within each group identified by my grouping field and we’ll compute the maximum within that group and so the result of this will be that I can see the maximum level for each group now I’m going to delete this and I don’t need to limit myself to a single aggregation I can write as many aggregations as I wish so I will put this down here and I’ll actually give it a label so that it makes sense and then I will write a bunch of other aggregations such as count star which basically is the number of values within that class um I can also look at the minimum level I can also look at the average level so let’s run this and make sure that it works so as you can see we have our unique values for class as usual and then and for each class we can compute as many aggregated values as we want so we have the maximum level the minimum level and we didn’t give a label to this so we can call it average level and then number of values n values is not referring to level in itself it’s a more General aggregation which is simply counting how many examples I have of each class right so I know I have four Mages three archers four Hobbits and four Warriors by looking at this value over here and here’s another thing I am absolutely not limited to the level column as you can see I also have the experience column which is also an integer and the health column which is a floating Point number so I can get the maximum health and I can get the minimum [Music] experience and it all works all the same all the aggregations are computed within each class but one thing I need to be really careful of is the match between the type of aggregation that I want to run and the data type of the field on which I plan to run it so all of these that we show here they’re number columns right either integers or floats what would happen if I ran the average aggregation on the name column which is a string what do you expect to happen you can already see that this is an error why no matching signature for aggregate function average for a type string so it’s saying this function does not accept the type string it accepts integer float and all types of number columns but if you ask me to find the average between a bunch of strings I have no idea how to do that so I can add as many aggregations as I want within my grouping but the aggregations need to make sense but these Expressions can be as complex as I want them to be so instead of taking the average of the name which is a string it doesn’t make sense I could actually run another function instead of this inside of this which is length and what I expect this to do is that for each name it will count how long that name is and then after I’ve done all these counts I can aggregate them uh I could take the average for them and what I get back is the average name length within each class doesn’t sound really helpful as a thing to calculate but this is just to show you that these Expressions can get quite complex now whatever system you’re working with it will have a documentation in some place which lists all the aggregate functions that you have at your disposal so here is that page for big query and as you can see here we have our aggregate functions and if you go through the list you will see some of the ones that I’ve shown you such as count Max mean and some others that uh I haven’t shown you in this example such as sum so summing up all the values um any value which simply picks uh one value I think it it happens at random and U array a which actually built a list out of those values and so on so when you need to do an analysis you can start by asking yourself how do I want to subdivide the data what are the different groups that I want to find in the data and then after that you can ask yourself what type of aggregations do I need within each group what do I want to know um about each group and then you can go here and try to find the aggregate function that works best and once you think you found it you can go to the documentation for that function and you can read the description so Returns the average of non-null values in an aggregated group and then you can see what type of argument is supported for example average supports any numeric input type right so any data type that represents a number as well as interval which represents a space of time now in the previous example we have used a single grouping field right so if we go back here we have our grouping field which is class and we only use this one field to subdivide the data but you can actually use multiple grouping Fields so let’s see how that works what I have here is my items table and for each item we have an item type and a rarity type uh and then for each item we know the power so what would happen if we wanted to say to see the average Power by item type and Rarity combination one reason we might want to see this is that we might ask ourselves is within every item type is it always true that if you go from common to rare to Legendary the power increases is this true for all item types or only for certain item types let us go and find out so what what I’m going to do now is that I’m going to use two fields to subdivide my data I’m going to use item type and Rarity and to do this as a first step I will sort the data so that it makes it convenient for me so I will go here and I will say sort range Advanced ranged sorting option and first of all I want to sort by column A which is item type and I want to add another sort column which will be column B and you can see that my data has been sorted next I’m going to take each unique combination of the values of my two grouping Fields okay so the first combination is armor common so I’m going to take this here and then I’m going to to write down all the values that come within this combination so in this case we only have one value which is 40 next I have armor legendary and within this combination I only have one value which is 90 next I have armor rare So for armor rare I actually have two values so I’m going to write them here next we have potion and common for this we actually have three values so I’m going to write them here so I’ve gone ahead and I’ve done it for each combination and you can see that each unique combination of item type and Rarity I’ve now copied the re relevant values and now I need to get the average power with in these combinations so I will take the first one put it over here and then I will take the average of the values this is quite easy because there’s a single value so I’ll simply write 40 next I will take the armor legendary combination and once again I have a single value for armor rare I have two values so I will actually press equal and write average to call the the spreadsheet function and then select the two values in here to compute the average and here we have it and I can go on like this potion common get the average Within These values potion legendary is a single value so I’ve gone ahead and completed this and this gives me the result of my query here I have all the different combinations for the values of uh what were they item type and Rarity and within each combination the average power so to answer my question is it that within each item type the power grows with the level of Rarity where for armor it goes from 40 to 74 to 90 so yes for potion we don’t have um a rare potion but basically it also grows from common to Legendary and in weapon we have uh 74 87 and 98 so I would say yes within each item type power grows with the level of Rarity so what are these three fields in the context of my grouping well item type is grouping field and Rarity is also a grouping field and the average power within each group is a aggregate field right so I am now using two grouping fields to subdivide my data and then I’m Computing this aggregation within those groups so let us now figure figure out how to write this in SQL it’s actually quite similar to what we’ve seen before we have to take our data from the items table and then we want to group by and here I have to list my grouping Fields okay so as I’ve said I have two grouping Fields they are item type and and Rarity so this defines my groups and then in the select part I will want to see my grouping fields and then within each group I will want to see the average of power I believe we used yes so I will get the average of power and here are our results just like in the sheets now as a tiny detail you may notice that power here is colored in blue and the reason for this is that power is actually a big query function so if you do power of two three you should get uh eight because it calculates the two to to to the power of three so it can be confusing when power is the name of a column because B query might think it’s a function but there’s an easy way to remedy this you can just use back ticks and that’s your way of telling big query hey don’t get confused this is not the name of a function this is actually the name of a column and as you can see it also works and it doesn’t create issues and just like before we could add as many aggregations as we wanted and for example we could take the sum of power also on other fields not just on Power and everything would be computed within the groups defined by the two grouping fields that I have chosen as expected now now let us see where Group by fits in The Logical order of SQL operations so as you know a SQL query starts with from and join this is where we Source the data this is where we take the data that we need and as we learned in the join section we could either just specify a single table in the from clause or we could specify a join of two or more tables either way the result is the same we have assembled the table where our data leaves and we’re going to run our Pipeline on that data we’re going to run all the next operations on that data next the work Clause comes into play which we can use in order to filter out rows that we don’t need and then finally our group group Pi executes so the group Pi is going to work on the data that we have sourced minus the rows that we have excluded and then the group Pi is going to fundamentally alter the structure of our table because as you have seen in our examples the group I basically compresses down our values or squishes them as I wrote here because in the grouping field you will get a single Row for each distinct value and then in the aggregate field you will get an aggregate value within each class okay so if I use a group bu it’s going to alter the structure of my table after doing the group bu I can compute my aggregations like you’ve seen in our examples so I can compute uh minimum maximum average sum count and and all of that and of course I need to do this after I have applied my grouping and after that after I I’ve computed my aggregations I can select them right so I can choose which columns to see um and this will include the grouping fields and the aggregated fields we shall see this more in detail in a second and then finally there’s all the other oper ations that we have seen in this course and this is where Group by and aggregations fit in our order of SQL operations now I want to show you an error that’s extremely common when starting to work with group pi and if you understand this error I promise you you will avoid a lot of headaches when solving SQL problems so I have my IDE items table here again and you can see the preview on the right and I have a simple SQL query okay so take the items table Group by item type and then show me the item type and the average level of power within that item type so so far so good but what if I wanted to see what I’m showing you here in the comments what if I wanted to see each specific item the name of that item the type of that item and then the average Power by the type of that item right so let’s look at the first item chain mail armor this is a armor type of item and we know that the average power for armors is 69.5 so I would like to see this row and then let’s take Elven bow now Elven baow is a weapon as you can see here the average powerful weapons is 85. 58 and so I would like to see that now stop for a second and think how might I achieve this how might I modify my SQL query to achieve this oh and there is a error in the column name over here because I actually wanted to say name but let’s see how to do it in the SQL query so you might be tempted to Simply go to your query and add the name field in order to reproduce What you see here and if I do this and I run it you will see that I get an error select expression references column name which is neither grouped nor aggregated understanding this error is what I want to achieve now because it’s very important so can you try to figure out on your own why this query is failing and what exactly this error message means so I’m going to go back to my spreadsheet and get a copy of my items table and as you can see I have copied the query that doesn’t work over here so let us now uh go ahead and reproduce this query so I have to take the items table here it is and then I have to group by item type and as you can see I’ve already sorted by item type to facilitate our work and then for each item we want to select the item type so that would be armor and we want to select the average power so to find that I can run a spreadsheet function like this it’s called average and get the power over here and then I am asked to get the name so if I take the name for armor and put it here this is what I have to add and here you can already see the problem that we are facing for this particular class armor there is a mismatch in the number of rows that each column is providing because as an effect of group by item type now there is only one row in which item type is armor and as an effect of applying average to power within the armor group now there is only one row of power corresponding to the armor group but then when it comes to name it’s neither present in a group Pi nor is it present in an aggregate function and that means that in the case of name we still have four values four values instead of one and this mismatch is an issue SQL cannot accept it because SQL doesn’t know how to combine columns which have different numbers of rows in a way it’s like SQL is telling us look you’ve told me to group the data by item type and I did so I found all the rows that correspond to armor and then you told me to take the average of the power level for those rows and I did but then you asked me for name now the item type armor has four names in it what am I supposed to do with them how am I supposed to combine them how am I supposed to squish them into a single value you haven’t explained how to do that so I cannot do it and this takes us to a fundamental rule of SQL something I like to call the law of grouping and the law of grouping is actually quite simple but essential it tells you what type of columns you can select after you’ve run a group pi and there are basically two types of columns that you can select after running a group bu one is grouping Fields so those those are the columns that appear after the group by Clause those are the columns you are using to group the data and two aggregations of other fields okay so those are fields that go inside a Max function a mean function a sum function a count function and so on now those are the only two types of columns that you can select if you try to select any other column you will get an error and the reason you will get an error is Illustrated here after a group Pi each value in the grouping Fields is repeated exactly once and then for that value the aggregation makes sure that there’s only one corresponding value in the aggregated field in this case there’s only one average power number within each item type however any other field if it’s not a grouping field and you haven’t run an aggregation on it you’re going to get all of its values and then there’s going to be a mismatch so the law of grouping is made to prevent this issue now if we go back to our SQL hopefully you understand now better why this error Isen happening and in fact this error message makes a lot more sense after you’ve heard about the law of grouping you are referencing a column name which is neither grouped nor aggregated so how could we change this code so that we can include the column name without triggering an error well we have two options either we turn it into a grouping field or we turn it into an aggregation so let’s try turning it into an aggregation let’s say for example that I said mean of name what do you expect would happen in that case so if I run this you will see that I have my grouping by item type I have the average power within each item type and then I have one name and so when you run mean on a sequence of uh text values what it does is that it gives you the first value in alphabetical order so we are in fact seeing the first name in alphabetical order within each item type so we’ve overcome the error but this field is actually not very useful we don’t really care to see what’s the first name in alphabetical order within each type but at least our aggregation is making sure that there’s only one value of name for each item type and so the golden rule of grouping is respected and we don’t get that error anymore the second alternative is to take name and add it as a grouping field which simply means putting it after item type type in here now what do you expect to happen if I run this query so these results as they show here are a bit misleading because there’s actually the name column is hidden so I will also add it here and as you can see I can now refer the name column in select without an aggregation why because it is a grouping field okay and what do we see here in the results well we’ve seen what happens when you Group by multiple columns that the unique combinations of these columns end up subdividing the data so in fact our values for average power are not divided by item type anymore we don’t have the average power for armor potion and weapon anymore we have the average power for an item that’s type armor and it’s called chain mail armor and that is in fact there’s only one row that does that and has power 70 likewise we have the average power for uh any item called cloak of invisibility which is of item type armor and again there’s only one example of that so we’ve overcome our error by adding name as a grouping field but we have lost the original group division by item type and we have subdivided the data to the point that it doesn’t make sense anymore so as you surely have noticed by now we made the error Disappear by including name but we haven’t actually achieved our original objective which was to show the name of each item the item type and then the average power within that item type well to be honest my original objective was to teach you to spot this error and understand the law of grouping but now you might rightfully ask how do I actually achieve this and the answer unfortunately is that you cannot achieve this with group Pi not in a direct simple way and this is a limitation of group Pi which is a very powerful feature but it doesn’t satisfy all the requirements of aggregating data the good news however is that this can be easily achieved with another feature called window functions now window functions are the object of another section of this course so I’m not going to go into depth now but I will write the window function for you just to demonstrate that it can be done easily with that feature so I’m going to go down here and write a a new query I’m going to take the items table and I’m going to select the name and the item type and then I’m going to get the average of power and again I’m going to use back ticks so bigquery doesn’t get confused with the function that has the same name and then I’m going to say take the average of power over Partition by item type so this is like saying average of power based on this item type and I will call this average Power by type and if I select this and run the query you will see that I get what I need I have a chain mail armor it’s armor and the average power for an armor is 69.5 so this is how we can achieve the original objective unfortunately not with grouping but with window functions now I want to show you how you can filter on aggregated values after a group buy so what I have here is a basic Group by query go to the fantasy characters table group it by class and then show me the class and within each class the average of the experience for all the characters in that class and you can see the results here now what if I wanted to only keep those classes where the average experience is at least 7,000 how could I go and do that one Instinct you might have is to add a wear filter right for example Le I could say where average experience is greater than or equal to 7,000 and if I run this I get an error unrecognized name average experience the wear filter doesn’t work here maybe it’s a labeling problem what if I actually add the logic instead of the label so what if I say where average of experience is bigger or equal to 7,000 well an aggregate function is actually not allowed in the work Clause so this also doesn’t work what’s happening here now if we look at the order of SQL operations we can see that the where Clause runs right after sourcing the data and according to our rules over here an operation can only use data produced before it and doesn’t know about data produced after it so the wear operation cannot have any way of knowing about aggregations which are computed later after it runs and after running the group bu and this is why it is not allowed to use aggregations inside the wear filter luckily SQL provides us with a having operation which works just like the wear filter except it works on aggregations and it works on aggregations because it happens after the group buy and after the aggregations so to summarize you can Source the table and then drop rows before grouping this is what the wear filter is for and then you can do your grouping and Compu your aggregations and after that you have another chance to drop rows based on a filter that runs on your aggregations so let us see how that works in practice now instead of saying where average experience actually let me just show you what we had before this is our actual result and we want to keep only those rows where average experience is at least 7,000 so after group Pi I will write having and then I will say average experience greater than or equal to 7,000 let me remove this part here run the query and you can see that we get what we need and you might be thinking well why do I have to to write down the function again can’t I just use the label that I’ve assigned well let’s try it and see if it still works and the answer is that yes this works in Big query however you should be aware that bigquery is an especially userfriendly and funto use product in many databases however this is actually not allowed in the sense that the database will not be kind enough to recognize your label in the having operation instead you will have to actually repeat the logic as I’m doing now and this is why I write it like this because I want you to be aware of this limitation another thing that you might not realize immediately is that you you can also filter by aggregated columns which you are not selecting so let’s say that I wanted to group by class and get the average experience for each class but only keep classes with a high enough average level I am perfectly able to do that I just have to write having average level greater than or equal to 20 and after I run this you will see that instead of four values I actually get three values so I’ve lost one value and average level is not shown in the results but I can of course show it and you will realize that out of the values that have stayed they all respect this condition they all have at least 20 of average level so in having you are free to write filters on aggregated values regardless of the columns that you are selecting so to summarize once more you get the data that you need you drop rows that are not needed you can then Group by if you want subdivide the data and then compute aggregations within those groups if you’ve done that you have the option to now filter on the result of those aggregations and then finally you can pick which columns you want to see and then apply all the other operations that we have seen in the course we are now ready to learn about window function a very powerful tool in SQL now window functions allow us to do computations and aggregations on multiple rows in that sense they are similar to what we have seen with aggregations and group bu the fundamental difference between grouping and window function is that grouping is fundamentally altering the structure of the table right because if I go here and I take this items table and I group by item type right now I’m looking at uh about 20 rows right but if I were to group the resulting table would have one two three three rows only because there’re only three types of items so that would significantly compress the structure of my table and in fact we have seen with the basic law of grouping that after you apply a group ey you have to work around this fundamental alteration in the structure of a table right because here you can see that the items table has 20 rows but how many rows do you expect it to have after you Group by item type I would expect it to have three rows because there’s only three types of items and so my table is being compressed my table is changing its structure and the basic law of grouping teaches you how to work with that it tells you that if you want a group by item type you can just select power as is because your table will have three rows but you have 20 values of power so you have to instead select an aggregation on power so that you can compress those values to a single value for each item type and if you want to select name you also cannot select name as is you also have to apply some sort of aggregation for example you could put the names into a list an array uh or so on but window functions are different window functions allow us to do aggregations allow us to work on multiple values without however altering the structure of the table without changing the number of rows of the table so let us see how this works in practice now imagine that I wanted to get the sum of all the power values for my items so what is the total power for all of my items so you should already be aware of how to do this in SQL to just get that sum right I can I can do this by getting my fantasy items table and then selecting the sum over the power so if I take this query and paste it in big query I will get exactly that and this now is a typical aggregation right the sum aggregation has taken 20 different values of power and has compressed them down to one value and it has done the same to my table it’s taken 20 different rows to my table and it has squished them it has compressed them down to one row and this is how aggregations work as we’ve seen in the course but what if I wanted to show the total power without altering the structure of the table what if I wanted to show the total power on every Row in other words I can take the sum of all the values of power and this is the same number that we’ve seen in B query and I can paste it over here and hopefully I can now expand it and this is exactly what I meant what if I can take that number and put it on every row and why would I want to do this well there’s several things that I can do with this setup right for example I could go here um for Phoenix Feather which is power 100 and I could say take this 100 and divide it by the total power in this row and then turn this into a percentage and now I have this 6.5 approximately percentage and thanks to this I can say Hey look um the phoenix feather covers about 6 or 7% of all the power that is in my items of all the power that is in my game and that might be a useful information a more mundane concern uh could be that this is uh your your budget so this is the stuff you’re spending on and instead of power you have the the price of everything and then you get the total sum right which is maybe what you spent in a month and then you want to know going at the movies what percent of your budget it covered and so on now I will delete this value because we’re not going to use it and let us see what we need to write to obtain this result in SQL so once again we go to the fantasy items table and I’m going to move it a bit down and then we select the sum power just just like before except that now I’m going to add this over open round bracket and close round bracket and this is enough to obtain this result well to be precise when I write this in B query I will want to see a few columns as well so I will want to see the name item tab Ty and power and here I will need a comma at the end as well as the sum power over and I will also want to give a label to this just like I have in the spreadsheet now this is the query that will reproduce What you see here in the spreadsheet so how this works is that the over keyword is signaling to SQL that you want to use a window function and this means that you will get an aggregation you will do a calculation but you’re not going to alter the structure of the table you are simply going to take the value and put it in each row this is what the over keyword signals to SQL now because this is a window function we also need to define a window what exactly is a window a window is the part of the table that each row is is able to see now we will understand what this means much more in detail by the end of this lecture so don’t worry about it but for now I want to show you that this is the place where we usually specify the window inside these brackets after the over but we have nothing here and what this means is that our window for each row is the entire table so that’s pretty simple right each row sees the entire table so to understand how the window function is working we always have to think row by row because the results can always be different on different rows so let us go row by row and figure out how this window function is working so now we have the first row and what is the window in this case meaning what part of the table does does this row see well the answer is that this row sees all of the table given that it sees all of the table it has to do the sum of power and so it will take this thing compute a sum over it put it in the cell now that was the first row moving on to the second row now what’s the window here what part of the table does this row see once again it sees all of the table given that it sees all of the table it takes power computes some over it gets the result and puts it in the cell now I hope you can see that the result has to be identical in every cell in every Row in other words because every row sees the same thing and every Row computes the same thing and this is why every Row in here gets the same value and this is probably the simplest possible use of a window function so let us now take this code and bring it to B query and make sure that it runs as intended and like I said in the lecture on grouping you will see that power is blue because bequer is getting confused with its functions so always be best practice to put it into back tis to be very explicit that you are referring to a column but basically what you see here is exactly what we have in our sheet and now of course we have this new field which shows me the total of power on every row and like I said we can use this for several purposes for example I can decide to show for each item what p percentage of total power it covers right that’s what I did before in the sheet so to do this I can take the power and I can divide by this window expression which will give me the total power not sure what happened there but let me copy paste here and I can call this percent total power now this is actually just a division so if I want to see the percentage I will have to also multiply by 100 but we know how to do this and once I look at this we can see that when we have power 100 we have almost 6.5% of the total power so this is the same thing that we did before and this goes to show that you can use these fields for your calculations and like I said if this was your budget you could use this to calculate what percentage of your total budget is covered by each item it’s a pretty handy thing to know now why do I have to take this uh to repeat uh all of this logic over here why can’t I just say give me power divided by some power well as you know from other parts of the course the select part is not aware of these aliases it’s not aware of these labels that we are providing so when I try to do this it won’t recognize the label so unfortunately if I want to show both I have to repeat the logic and of course I’m not limited to just taking the sum right what I have here is an aggregation function just like the ones we’ve seen with simple aggregations and grouping in aggregation so instead of sum I could use something like average using the back TI over right I need to remember uh to add the over otherwise it won’t work because it won’t know it’s a window function and I can give it a label and now for each row I will see the same value which is the average of power over the whole data set and you you can basically use any aggregation function that you need it will work all the same few more btics to put in here just to be precise but the result is what we expect now let us proceed with our Explorations so I would like now to see the total power for each row but now I’m not interested in the total power of the data set I’m interested of in the total Power by item type okay so if my item is an armor I want to see the total power of all armors if my have item is a potion I want to see the total power of all potions and so on because I want to compare items within their category I don’t want to compare every item with every item so how can I achieve this in the spreadsheet well let us start with the first r row so I need to check what item type I have and conveniently I have sorted this so we can be quicker now we have an armor so I want to see the total power for armor so what I can do is to get the sum function and be careful to select only rows where the item type is armor and this is what I get and then the next step would be to Simply copy this value and then fill in all of the rows which are armor because for all of the rows but again you have to be careful because the spreadsheet wants to complete the pattern but what I want is the exact same number and then all of the rows that have item type armor will have this value because I’m looking within the item type now I will do it for potion so here I need to get the sum of power for all items that are potions 239 and then make sure to co copy the exact same value and to extend it to all potions and next we have weapons so sum of all power by weapon which is here then copy it and copy it and then let’s see if it tries to complete the pattern it does so I’m just going to go ahead and paste it and now make this a bit nicer and now I have what I wanted to get each row is showing the total power within the items that are the same as the one that we see in the row now how can I write this in SQL so let me go ahead and write it here now two parts of this query will be the same same because we want to get the items table and see these columns but we need to change how we write the window function so once again I want to get the sum of power and I will need now to define a specific window now remember the window defines what each row sees so what do I want each row to see when it takes the sum of power for example what do I want the first row to see when it takes the sum of power I wanted to see only rows which have the item type armor or in other words all the rows with the same item type and I can achieve this in the window function by writing Partition by item type by adding a partition defining the window as a partition by item type means that each row will look at its item type and then we’ll partition the table so that it only sees rows which have the same item type so this row over here will see only these four rows and then you will take the sum of power and then you will put it in the cell and for this uh the second third third and fourth row the result will be the same because they will each see this part of the table when we come to potion so this row over here will say hey what is my item type it’s potion okay then I I will only look at rows that have item type potion and so this will be the window for these four rows and then in those rows I’m going to take power and I’m going to Summit and finally when we come to to these rows over here so starting with this row it will look at its item type and say okay I have item type uh weapon let me look at all the rows that share the same item type and so each window will look like this so let me color it properly its window will look like this and then it will take the sum of these values of power that fit in the window and put it in the cell second cell sees the same window sums over these values of power puts it in the cell and this is how we get the required result this is how we use partitioning in window functions so let’s go now to Big query and make sure that this actually works and when I run this I didn’t put a label but you can see that I’m basically getting the same result when I have a weapon I see a certain value when I have a potion I see uh another one and when I have an armor I see the third value so now for each item I am seeing the total power not over the whole table but within the item type now next task find the cumulative sum of power which is this column over here what is a cumulative sum it’s the sum of the powers of this item plus all of the items that are less powerful so to do this in the spreadsheet I will first want to reorder my data because I want to see it simply in order of power so I will actually take this whole range and I will go to data sort range Advance options and I will say that the data has a header row so that I can see the names of the columns and then I will order by power ascending so as you can see my records have now been sorted in direction of ascending power now how do I compute the cumulative sum of power in the first row all we have is 30 so the sum will be 30 in the second row I have 40 in this row plus 30 before so E I will have 70 when it comes here I have 50 in this row and then the sum up to now was actually 70 which I can see by looking at these two cells or I can see more simply by looking at the last cell so 50 + 70 will be 820 and proceeding like this I could compute the cumulative power over the whole column now for your reference I have figured out the correct Google Sheets formula that will allow you to compute the cumulative sum of power for our example and I went ahead and computed it so that we have it for all our data now this is is the formula right here and I’m not going to go in depth into it because this is not a course on on spreadsheets but I will show you the formula just in case you’re curious so the sum IF function will take the sum over a range only but it will only consider values that satisfy a certain logical condition so the first argument is the range that we want to sum over and this is the power and the Criterion so what needs to be true for a value to be um to be considered is that this value is lesser than or equal to the level of power in this row so what this formula is saying is take the level of power in this row and then take all the values of power which are lesser or equal and then sum them up this is exactly what our window function does and so our formula reproduces this now if you go and look what’s the way to do a cumulative sum in Google Sheets or what’s the way to do a running total there are other Solutions but they do come with some um pitfalls they do come with some Corner cases so this is a Formula that’s actually reproducing the behavior of SQL now let us go back to actually SQL and see how we would write this so I’m going to take the fantasy items table and I’m still going to select the columns and now I have to write my window function now the aggregation is just the same so take the sum of power and now I have to Define my window now my window is not defined Now by a partition but it is defined by an ordering order by power and when I say order by power in a window function what’s implicit in this is the keyword ask for ascending so this means that the window will order power from the smallest to the biggest and I can choose to write this keyword or not because just like in order by in SQL when you don’t specify it the default value is ascending from smallest to biggest so how does this window work work let’s start with the first row and let’s say we need to fill in this value so I’m going to look at my power level it is 30 and then the window says that I can only see rows where the power level is equal or smaller and what are the rows where the power level is equal or smaller to 30 there’re these rows over here so effectively this this is the only part of the table that this window sees on the first row and then take the sum over power so sum over 30 is 30 move on to the second row the power level is 40 the window says I only see rows where the power level is smaller uh or equal and this includes these two rows over here now take the sum of power over here you get 70 put it in the cell third row I have power level 50 I’m only seeing these rows so take the sum of power over this it’s 120 put it in the cell and I can continue like this until I get to the highest value in my data set it’s 100 never mind that is not the last row because both of the last two rows they have the highest value and when you look at this um when you come to this row and you look at 100 and and you say what’s the window what rows can I see I can see all rows where power is 100 or less and that basically includes all of the table right it includes all of the table so when you take the sum of power you will get the total sum and in fact you can see that in this case the cumulative power is equal to the total power that we computed before just as we would expect so this is easy to see here because we have ordered um our data conveniently but it works in any case and so what the order by does in a window function is that it makes sure that each row only sees rows which come before it given your ordering so if I want to order from the smallest power to the biggest power each row will only see rows that come before it in this ordering so they have the same level of power or lower but they don’t have a higher level of power so let us now take it to Big query and make sure it works as intended and I will add an ordering by power and here I will see the same thing that I’ve shown you in the spreadsheet I notice now that some numbers are different that these two items have 90 instead of 100 but never mind that the logic is the same and the numbers make sense now I’m also able to change the direction of the ordering right so let’s say that I take this field and copy it just the same except that instead of ordering by power ascending I order by power descending so what do you expect to see in this case let’s take a look now what I see here is that each item is going to look at its level of power and then it’s only going to consider items that are just as powerful or more powerful right so it’s the exact same logic but it’s reversed so when you look at the weakest item potion it has 30 and so it is looking at all the items because there’s no weaker item and so it finds the total level of power in our data set but if you go to the strongest item like Excalibur it has a power level of 100 and there’s only two items in the whole data set that have this power level itself and the phoenix feather so if you sum the power over this you get 200 so you can see it’s the exact same logic but now each row only sees items that have the same level of power or higher so when you order inside a window function you can decide the direction of this ordering by using descending or ascending or if you are a lazy programmer you can omit the um ascending key word and it will work just the same because that’s the default and finally we want to compute the cumulative sum of Power by type and you might notice that it is in a way the combination of these two uh requirements so let us see how to do that now the first thing I want to do is to sort our data in order to help us so I’m going to get this whole thing and I’m going to say sort range I’m going to need the advanced options I have a heading row and so first of all I want to order by type and then within each type I want to order by power and this is our data now now for each item I want to show the cumulative sum of power just like I did here except that now I only want to do it within the same item type so if we look at Armor it’s already sorted right so I have power 40 and this is the smallest one so I will just put 40 over here next I have uh this item with power 70 it’s still armor has power 70 and so I’m going to look at these two values and sum them up now I have uh 7 8 so I will take this plus 78 which is the sum of these three values and finally I have um 90 which is the sum of those values and now I’m done with armor right I’m beginning with a new item type so I have to start all over again I’m looking at potions now so we start with 30 that is the smallest value then we move to uh 50 so this is now seeing 30 and 50 uh which is 80 add 60 to 80 that is 140 and finally we want to add we want to add 99 plus 140 which is another way of saying that we want to add these values all the values for potion so this is what we want cumulative sum of power within item type so we do it within the item type and then when we find a new type we start over so to calculate it for weapon I could copy my function from here paste it in weapon and then I would need to modify it right I would need the range to only include weapon so that’s from C10 so go here C10 is the first one and the value that I want to look at here would have to be C10 as well because I want to start by looking at the power level for the for weapon and for some reason it’s purple however it should be correct it should always be the sum of the previous value so we start with 65 then we have 65 + 75 66 75 65 and so on so this is our result it’s cumulative power within the item type and to write this in SQL I will take my previous query over here and now when we Define the window we can simply combine what we’ve done before we can combine the partition buy with the order bu and you need to write them in the following order first the partition and then the order so I will Partition by item type and I will order by power ascending and this will achieve the required result so for each row in this field the window will be defined as follows first Partition by item type right so first of all you can only see rows which have the same item type as you have but then within this partition you can you have to keep only rows where the power is equal or smaller than what you have so in the case of the first item you only get this row likewise in the case of the first potion item you only get this row if you look at the second armor item again it looks it partitions right so it looks at all the items which have armor but then it has to discard those that have a bigger power than itself so it will be looking at these two rows and if for example example we look at the last row over here so this row will say oh okay I’m a weapon so I can only see those that are weapon and then I can only see those that have a level of power that’s equal or smaller than mine and that checks out those are all the rows and in fact the sum over here is equal to the sum of Power by type which is what we would expect so once again let us verify that this works in Big query and I will actually want to order by item type and power just so I have the same ordering as in my sheet and I should be able to see that within armor you have this like growing uh cumulative sum and then once the item changes it starts all over right it starts again at the value it grows it grows it accumulates and then we’re done with potions and then we have weapons and then again it starts and then it grows and it goes all the way to include the total sum of all powers in the weapon item type so here’s a summary of all the variants of Windows that we’ve seen we have seen four variants now in all of those for clarity we’ve kept the aggregation identical right we are doing some over the power field but of course you know that you can use any aggregate function here on any column which is compatible with that aggregate function and then we have defined four different Windows the first one is the simplest one there’s actually nothing in the definition we just say over and this means that it will just look at all the table so every row will see the whole table and so every row will show you the total of power for the whole table simple as that the second window is introducing a partition by item type and what this means in practice is that each row will uh look at its own item type and then only consider rows which share the same exact item type and So within those rows it will calculate the sum of power third window we have an ordering field so what this means is that each row is going to look at its level of power because we are ordering by power and then it’s going to only see rows where the power level is equal or smaller and the reason why we’re looking in this direction is that when we order by power is implicitly uh understood that we want to order by power ascending If instead we ordered by power descending it would be the same just in the opposite direction each row would would look at its level of power and then only consider rows where power is equal or bigger and then finally we have a combination of these two right a we have a window where we use both a partition and an order and so what this means is that uh each row is going to look at its item type and discard all of the rows which don’t have the same item type but then within the rows that remain it’s going to apply that ordering it’s going to only consider rows which have the same level of power or lesser so it’s simply a combination of these two conditions and this is the gist of how window functions work first thing to remember window function provide aggregation but they don’t change the structure of the table they just insert a specific value at each row but after applying a window function the number of rows in your table is the same second thing thing to remember is that in the window definition you get to Define what each row is able to see when Computing the aggregation so when you are thinking about window function you should be asking yourself what part of the table does each row see what’s the perspective that each row has and there are two dimensions on which you can work in order to Define these windows one is the partition Dimension and the other is the ordering Dimension the partition Dimension Cuts up the table based on the value of a column so you will only keep rows that have the same value the order Dimension Cuts up the table based on the ordering of a field and then depending on ascending or descending depending on the direction that you choose you can you can look at rows that are after you in the ordering or you can look at rows that are before you in the ordering and you can pick either of these right either partitioning or ordering or you can combine them and by using this you can Define all of the windows that you might need to get your data now as a quick extension of this I want to show you that you’re not limited to defining windows on single fields on single columns you can list as many columns as you want so in this example I’m going to the fantasy characters table I’m getting a few columns and then I’m defining an aggregation uh on a window function so I’m taking the level uh field and I’m summing it up and then I’m partitioning by two Fields actually by Guild and is alive so what do you expect to happen if I do this this is actually the exact same logic as grouping by multiple fields which we’ve seen in the group ey now the data is not going to be divided by Guild and is not going to be divided by whether the character is alive or not but by the all the mutual combinations between these fields okay so um merkwood and true is one combin ation and so the people in here are going to fit together right so in fact we have two characters here 22 and 26 and their sum is 48 so you can see here that they both get 48 for sum of level and likewise when you look at Sher folk true these three they all end up in the same group and so they all share the same sum of level which is 35 but sh Fulk fals this is another group and they’re actually alone right it’s 12 and then the sum is 12 so again when you Partition by multiple Fields the data is divided in groups that are obtained by all the combinations between the values that these fields can have and if you experiment a bit by yourself you should have an easier time to convince yourself of this likewise the same idea applies to the order uh part of a window we have until now for Simplicity ordered by one field to be honest most times you will only need to order by one field but sometimes you might want to order by different fields so in this example you can see that we are defining our ordering based on two Fields power and then weight and then based on that ordering we calculate the sum of power and this is again a case of cumulative sum however now the ordering is different and you will realize this if we go to the most powerful items in our data these last two which are both at 100 so if you remember when we were ordering by power alone these two uh Fields had the same value in this um window function because when you order just by power they are actually the same they both have 100 but because now we’re ordering by weight and again we’re ordering by weight ascending so from the smallest weight to the biggest weight now the phoenix feather comes first because although it has the same power as Excalibur the Phoenix weather is lighter and because it comes first it has a different value for this aggregation and of course we have the power to to say ascending or descending on each of the fields by which we order so if I wanted to reverse this I could simply write descending after the weight and be careful that in this case descending is only referring to weight it’s not referring to power so this is just as if I’ve wrote this right so the this one can be omitted um because it’s ascending by default but I would write both to be clear and now if I run this you will see that our result is reversed right Excalibur comes first because we have weight descending so it’s heavier and then last we have the phoenix feather which is lighter and again understanding this theoretically is one thing but I do encourage you to experiment with this with your data with exercises and then you will um you will be able to internalize it and now we are back to our schema for The Logical order of SQL operations and it is finally complete again because we’ve seen all of the components that we can use to assemble our SQL query and now the question is where do window functions fit into this well as you can see uh we have placed them right here so what happens is that again you get your data and then the we filter runs dropping rows which you don’t need and then you have a choice whether to do a group by right now if you do a group by you’re going to change the structure of your table it’s not going to have the same number of rows it’s going to have a number of rows that depends of the unique values of your grouping field or the unique combinations of values of your Fields if you have used more than one if you group you will probably want to compute some aggregations and then you may want to filter on those aggregations meaning dropping rows uh based on the values of those aggregations and here is where window functions come into play it is on this result that window functions work so if you haven’t done a group bu then window functions will work on your data after the wear filter runs if you have done a group buy we window functions will work on the result of your aggregation and then after applying the window function you can select which columns you want to show give them uh labels and then all the other parts run right so you can choose to drop duplicates from your result meaning duplicate rows rows which have the same value on every column you can stack together different tables right you can put them on top of each other and then finally when you have your result you can apply some ordering and also you can cut the result you can limit it so you only show a few uh rows and this is where window functions fit into the big scheme of things and there are some other implications of this ordering one interesting one is that if you have computed aggregations such as the sum of a value Within within a um a class um you can actually use those aggregations in the window function so you can sort of do an aggregation of an aggregation but this is uh in my opinion an advanced topic and it doesn’t fit into this um fundamentals course it may fit uh someday in a later more advanced course I want to show you another type of window functions which are very commonly used and very useful in SQL challenges and SQL interviews and these are numbering functions numbering functions are functions that we use in order to number the rows in our data according to our needs and there are several numbering functions but the three most important ones are without any doubt row number dense Rank and rank so let’s let’s see how they work in practice now what I have here is a part of my uh inventory table I’m basically showing you the item ID and the value of each number and conveniently I have ordered our rows uh by value ascending okay and now we are going to number our rows according to the value by using these window functions now I’ve already written the query that I want to reproduce so I’m going to the fantasy inventory table and then I’m selecting the item ID and the item value as you see here and then I’m using uh three window functions so the syntax is the same as what we’ve seen uh in the previous exercise except that now I’m not using an aggregation function over a field like I did before when I was doing a sum of power and so on but I’m using another type of function this is a numbering function okay so this functions over here they don’t actually take a parameter as you can see that there’s nothing between these round brackets because I don’t need to provide it an argument or a parameter all I need to do is to call the function but what really uh what’s really important here is to define the correct window and as you can see in the three examples here the windows are all the same I am simply ordering my rows by value ascending which means that when it’s going to compute the window function every row will look at its own value and then say okay I’m only going to see rows where the value is the same or smaller I’m not going to be able to visualize rows where the value is bigger than mine and this is what the window does so the first row over here will’ll only see value of 30 the second row will see this the third row will see these and so on up until the last row which will see itself and all the other rows as well now let us start with row number so row number is going to use this ordering over here in order to number my rows and it’s as simple as saying putting one in the first row two in the second one 3 four and so on so if I extend this pattern I’m going to get a number for every row and that’s it that’s all that row number does it assigns a unique integer number to every row based on the ordering that’s defined by the window function and you might think oh big deal why do I need this don’t I already have like row numbers over here in the spreadsheet well in Pro SQ problems you often need to order things based on different values and um row number allows you to do this you can also have many different orderings coexisting in the same table based on different conditions and that can come in handy as you will discover if you do SQL problems now let’s move on to ranking so first of all we have dense rank okay and ranking is another way of counting but is slightly different sometimes you just want to count things you know sometimes uh like we did here in row number like I don’t know you are a dog sitter and you’re given 20 dogs and you getting confused between all their their names and then you assign a unique number to every dog so that you can identify them uh and you can sort them by I don’t know by age or by how much you’re getting paid to docit them sometimes on the other hand you want to rank things like when choosing which product to buy or expressing the results of a race right if and the difference between ranking and Counting can be seen when you have the same value right so when you want to Simply number like we did here when you want to Simply assign assign a different number to each element and two things have the same value then you don’t really care right you need to sort of arbitrarily decide that okay one of them will be a number two and one of them will be number three but you cannot do the same for ranking if two students in a classroom get the best score you can’t just randomly choose that one of them is number one and the other is number two they have to both be number one right and if two people finish a race at at the same time and is the best time you can’t say that one uh won the race and the other didn’t that because one is number one the other is arbitrarily number two they both have to be number one right they have to share that Rank and this is where ranking differs so let’s go in here and apply our rank now we are ordering by value ascending which means that the smallest value will have rank number one and so 30 has rank number one now we go to the second row and again remember window functions that you always have to think row by row you have to think what each row sees and what each row decides so again the row is going to order by uh value so it’s only going to see these values over here and it has to decide its rank so this row says uh oh I’m not actually number one because there is a value which is smaller than me so that means I have to be number number two and then we get to the third row and this row is uh seeing all the values that come before it right they’re equal or or or smaller and now it’s saying oh I’m not number one because there’s something smaller but then uh the value 50 which uh this guy has uh is rank two and I have the same value number 50 we arrived in the same spot so I must have the same rank okay and this is the difference between row number and rank that identical values get the same rank but they don’t get the same row number and now we come to this row which is 60 so it’s going to look back and it’s going to say oh from what I see 30 is the smallest one so it has a rank of one and then you have 50 and 50 they both share a rank of two but I am bigger so I need a new rank and so what am I going to pick now as a new rank well I’m going to pick three because it’s the next uh number in the sequence then the next one is going to pick four the next one is going to pick five and then we have six and then it proceeds in the following way so I’ll do it quickly now so 7 8 9 10 11 and again careful here we’re sharing the same value so they are both 11 next we can proceed to 12 13 again the same value right so they have to share the 13th spot 14 so 14 for 1700 and then 14 again and then 15 and then 16 and this is what we expect to see when we compute the dense rank and finally we come to rank now rank is very similar to dense rank but there is one important difference so let’s do this again smallest value has rank number one like before and then we have 50 which has rank number two and then 50 is once more sharing rank number two and now we move from 50 to 60 so we need a new rank but instead of three we put four over here why do we put four because the previous rank covered uh two rows and it sort of at the three it sort of expanded to eight the three So based on the rules of Simply rank we have to lose the three and put four over here so this is just another way of managing ranking and you will notice that it conveys another piece of information compared to dense rank because not only I see that um this row over here has a different rank than the previous row but I can only I can also see how many members were covered by the previous uh ranks I can see that in the previous ranks uh they must have involved three members because I’m at four already and this piece of information was not available for dence rank so I will continue over here and so I have a new value which is uh rank five and then I have rank six rank seven rank 8 rank n Rank 10 rank 11 now I have rank 12 and again I have to share the rank 12 because two identical values but now because 12 has eaten up two spots I can’t use the 13 anymore the second 12 has like eaten the 13 and so I need to jump straight to 14 15 15 again and now I have to jump to 17 because 15 had two spots 17 again and now I have to jump to 19 and then finally I have 20 so you can see that the final number uh is 20 for rank just as with row number because it’s not only differentiating between ranks but it’s also counting for me how many elements have come before me how many rows are contained in the previous ranks I can tell that there’s 19 rows in the previous ranks uh because of how rank Works whereas with 10 rank we end ended up using only 16 uh ended up being only up to 16 so we sort of lost information on how many records we have and this might be one of the reasons why by default you have this method of ranking instead of this method of ranking even though dense rank seems more intuitive when you are uh building the ranking yourself so we can now take this query and hopefully I’ve written it correctly and go to big query and try to run it and as you can see we have our items they are sorted by value and then we have our numbering functions so row number should go from one to 20 without any surprises CU it’s just numbering the rows this dense rank should have rank one for the first and then these two should share the same rank because they have both have 50 and then the next rank is three so just as I’ve shown you in the spreadsheet similarly here you have 11 11 and then 12 rank uh instead starts off uh just the same uh smallest value has rank number one and the next two values have rank number two but then after using up two and two it’s like you’ve used up the three so you jump straight to four and after doing 15 and 15 you jump straight to 17 after doing 17 17 you jump straight to 19 and then the the highest number here is 20 which tells you how many rows you’re dealing with of course what you see here are window functions they work just the same as we I’ve shown you and so you could pick up Rank and you could order by value descending and then you will see you will find the inverse of that rank in the sense that the highest value item will give you rank one and it will go from there and the lowest value item will have sort of the the biggest rank number and and rank is often used like this you know the thing that has the most of what we want you know the biggest salary the biggest value the most successful product we rank it we make it so that it’s rank one it’s like the first in our race and then everyone else goes from there and so we often have actually we order by something descending when we calculate the rank and of course because these numbering functions are window functions they can also be combined with Partition by if you want to cut the data into subgroups so here’s an example on the fantasy characters table we are basically uh partitioning by class meaning that each row only sees the other rows that share the same class so archers only care about archers Warriors only care about Warriors and so forth and then within the class we are ordering by level descending okay so the highest levels come first and using this to rank the characters okay so if I go here then I can see that within the archers the highest level Archer has level 26 so they get the first Rank and then all the others is go down down from there and then we have our Warriors and the highest level Warrior is 25 and they also get rank one because they are being ranked within Warriors so this is like when you have races and there are categories this like when you have a race and there are categories within the race so there are like many people who arrive first because they arrive first in their category it’s not that everyone competes with everyone and so on and so forth you can see that each uh class of character has their own dedicated ranking and you can check the uh bigquery page on numbering function if you want to learn more about these functions you can see here the ones we’ve talked about rank row number and dense rank there are a few more but these are the ones that are most commonly used in SQL problems and because I know that it can be a bit confusing um to distinguish between row number dense Rank and rank here’s a visualization that you might find useful so let’s say that we have a list of values uh which are these ones and we are ordering them in descending order so you can see that there’s quite some repetition in these values and given this list of values how would these different numbering functions work on them right so here’s row number row number is easy it just um assigns a unique number to to each of them so it doesn’t matter that the values are sometimes the same you sort of arbitrarily pick um one to be one the other to be two and then you have three and then here you have 10 10 10 but it doesn’t matter you just want to order them so you uh do four five six and then finally seven dense rank is actually cares about the values being the same so 50 and 50 they both get one uh 40 gets two and then uh the 10 get three and then five gets four so easy the rank just grows uh using all the integer numbers dense rank is also assigning rank one to 50 and 50 but it’s also throwing away the two because there are two elements in here then the next one is getting rank three because the two has already been used and then the next batch 1011 is getting rank four but it’s also burning five and six and the next one then can only get rank seven so these are the differences between row number dance Rank and rank visualized we have now reached the end of our journey through the SQL fundamentals I hope you enjoyed it and I hoped that you learned something new you hopefully now have some understanding of the different components of SQL queries and the order in which they work and how they come together to allow us to do what we need with the data now of course learning the individual components and understanding how they work is only half the battle the other half of the battle is how do I put these pieces together how do I use them to solve real problems and in my opinion the response to that is not more Theory but it’s exercises go out there and do SQL challenges do SQL interviews find exercises or even better find some data that you’re interested in upload it in big query and then try to analyze it with SQL I should let you know that I have another playlist where I am solving 42 SQL exercises in postrest SQL and I think this can be really useful to get the other half of the course which is doing exercises and knowing how to face real problems with SQL and I really like this playlist because I’m using a free website a website that doesn’t require any sign up or any login uh it just works works and you get a chance to go there and do all of these exercises that cover all the theory that we’ve seen in this course and then after trying it yourself you get to see me solving it and my thought process and my explanation and I think it could be really useful if you want to deepen your SQL skills but in terms of uh how do I put it all together how do I combine all of this stuff I do want to leave you with another resource that I have created which is this table and this table shows you the fundamental moves that you will need to do whenever you do any type of data analytics and I believe that every sort of analytics that you might work on no matter how simple or complicated can ultimately be reduced to these few basic moves and what are these moves they should actually be quite familiar to you by now so we have joining and this is where we combine data from multiple tables based on some connections between columns and in SQL you can do that with the join then we have filtering filtering is when we pick certain rows and discard others so you know let’s look only at customers that joined after 2022 now how do you do that in SQL there are a few tools tools that you can use to do that the most important one is the wear filter and the wear filter comes in action right after you’ve loaded your data and it decides which rows to keep which rows to discard having does just the same except that it works on aggregated fields it works on fields that you’ve obtained after a group by qualify we actually haven’t seen it in this course because it’s not a universal component of SQL certain systems have it others don’t but qualify is basically also a filter and it works on the result of window functions and finally you have distinct which runs quite at the end of your query and it’s basically removing all duplicate rows and then of course you have grouping and aggregation and we’ve seen this in detail in the course you subdivide the data um on certain dimensions and then you calculate aggregate values within those Dimensions fundamental for analytics how do we aggregate in SQL we have the group by we have the window functions and for both of them we use aggregate functions such as sum average and so on and then we have column Transformations so this is where you apply logic uh arithmetic to transform columns combine column values and take take the data that you have in order to compute data that you need and we do this where we write the select right we can write calculations that involve our columns we have the case when which allows us to have a sort of branching logic and decide what to do based on some conditions and of course we have a lot of functions that make our life easier by doing specific next we have Union Union is pretty simp simple take tables that have the same columns and stack them together meaning put their rows together and combine them and finally we have sorting which can change how your data is sorted when you get the result of your analysis and can be also used in window functions in order to number or rank our data and these are really the fundamental elements of every analysis and every equal problem that you will need to solve so one way to face a problem even if you are finding it difficult is to come back to these fundamental components and try to think of how do you need to combine them in order to solve your problem and how can you take your problem and break it down to simpler operations that involve these steps now at the beginning of the course I promised you that uh we we would be solving a hard squl challenge together at the end of the course so here it is let us try now to solve this challenge applying the concepts in this course now as a quick disclaimer I’m picking a hard challenge because it’s sort of fun and it gives us um a playground to Showcase several Concepts that we’ve seen in the course and also because I would like to show you that even big hard scary ch Alles that are marked as hard and even have advanced in their name can be tackled by applying the basic concepts of SQL however I do not intend for you to jump into these hard challenges um from the very start it would be much better to start with basic exercises and do them step by step and be sure that you are confident with the basic steps before you move on to more advanced steps so if you have trouble uh approaching this problem or even understanding my solution don’t worry about it just go back to your exercises and start from the simple ones and then gradually build your way up that being said let’s look at the challenge marketing campaign success Advanced on strata scratch so first of all we have one table that we will work on for this challenge marketing campaign so marketing campaign has a few columns and it actually looks like this okay so there’s a user ID created that product ID quantity price now when I’m looking at the new table the one question that I must ask to understand it is what does each row represent and just by looking at this table I can have some hypotheses but I’m actually not sure what each row represents so I better go and read the text until I can get a sense of that so let’s scroll up and read you have a table of inapp purchases by user okay so this explains my table what does each row represent it represents an event that is a purchase okay so it means that user ID 10 bought product ID 101 in a quantity of three at the price of 55 and created that tells me when this happened so this happened 1st of January 2019 so great now I understand my table and now I can see what the problem wants from me let’s go on and read the question so I have a table of inapp purchases by users users that make their first inapp purchase are placed in a marketing campaign where they see call to actions for more Ina purchases find the number of users that made additional purchases due to the success of the marketing campaign the marketing campaign doesn’t start until one day after the initial app purchase so users that made one or multiple purchases on the first day do not count nor do we count users that over time purchase only the products they purchased on the first day all right so that was a mouthful okay so this on the first run it’s actually a pretty complicated problem so our next task now is to understand this text and to simplify it to the point that we can convert it into code okay and a good intermediate step before jumping into the code is to write some notes and we can use the SQL commenting feature for that so what I understand from this text is that users make purchases and we are interested in users that make additional purchases we’re interested in users who make additional purchases thanks to this marketing campaign how do we Define additional purchases additional purchase is defined as and the fundamental sentence is this one users that made one or multiple Pur purchases on the first day do not count so additional purchase happens after the first day right nor do we count users that over time purchase only the products they purchased on the first day so the other condition that we’re interested in is that it involves a product that was not bought the first day and finally what we want is the number of users so get the number of these users that should be a good start for us to begin writing the code so let us look at the marketing campaign table again and I remind you that each row represents a purchase so what do we need to find First in this table so we want to compare purchases that happen on the first day with purchases that happen the following day so we need a way to count days and what do we mean first day and following days do we mean the first day that the shop was uh open no we actually mean the first day that the user ordered right because the user signs up does the first order and then after that the marketing campaign starts so we’re interested in numbering days for each user such that we know what purchases happened on the first day what purchases happened on the second day third day and so on and what can we use to run a numbering by user we can use a window function with a numbering function right so I can go to my marketing campaign table and I can select the user ID and the date in which they bought something and the product ID for now now I said that I need a window function so let me start and Define the window now I want to count the days within each user so I will actually need to Partition by user ID so that each row only looks at the rows that correspond to that same user and then there is an ordering right there is a a sequence from the first day uh in which the user bought something to the second and the third and so on so my window will also need an ordering and what column in my table can provide an ordering it is created at and then what counting function do I need to use here well the the way to choose is to say what happens when the same user made two two different purchases on the same date what do I want my function to Output do I want it to Output two different numbers as a simple count or do I want them want it to Output the same number and the answer is that I wanted to Output the same number because all of the purchases that happened on day one need to be marked as day one and all the purchases that have happened on day two need to be marked as day two and so on and so the numbering function that allows us to achieve this is Rank and if you remember ranking is works just like ranking the winners of a race everyone who shares the same spot gets the same number right and this is what we want to achieve here so let us see what this looks like now and let us order by user ID and created at let us now see our purchases now user 10 started buying stuff on this day they bought one product and the rank is one Let’s us actually give a better name to this column so that it’s not just rank and we can call it user day all right so this user id10 had first user day on the this date and they brought one product then at a later date they had their second user day and they bought another product and then they had a third now user 14 started buying on this date this was their first user day they bought product 109 and then the same day they bought product 107 and this is also marked as user day one so this is what we want and then at a later day they bought another product and this is marked as user day three remember with rank you can go from 1 one to three because this the F the spot marked as one has eaten the spot Mark as two that’s not an issue in this problem so we are happy with this now if we go back to our notes we see that we are interested in users who made additional purchases and additional means that it happen s after the first day and how can we identify purchases that happened after the first day well there’s a simple solution for this we can simply filter out rows that have a user day one right all of the rows where the user day is one represent purchases that the user made on their first day so we can discard this and keep only purchase that happened on the following days now I don’t really have a way to filter on this uh window function because as you recall from the order of SQL operation the window function happens here and the wear filter happens before that so the wear filter cannot be aware of what happens in the window function and the having also happens before it so I need a different solution to filter on this field what I need to do is to use a Common Table expression so that I can break this query in two steps so I’m going to wrap this logic into a table called T1 or I can call it purchases for it to be more meaningful and if I do select star from purchases you will see that the result does not change but what I can do now is to use the wear filter and make sure that the user day is bigger than one and if I look here you will see that I have all purchases which happened after the users first day but there is yet one last requirement that I have to deal with which is that the purchase must happen after the first day and it must involve a product that the user didn’t buy on the first day so how can I comply with this requirement now for all of the rows that represent a purchase I need to drop the rows that involve a product ID that the user bought the first day so if I find out that user 10 bought product 119 on day one this purchase does not count I’m not interested in it so how can I achieve this in code I’m already getting all the purchases that didn’t happen on day one and then I want another condition so I will say and product ID not in and here I will say products that this user bought on day one right it makes sense so this is all the filters I need to complete my problem show me all the purchases that happened not on day one and also make sure that the user didn’t buy this product on day one so what I need to do is to add a subquery in here and before I do that let me give a Alias to this table so so that I don’t get confused when I call it again in the subquery so this first version of purchases that we’re using we could call it next days because we’re only looking at purchases that happen after the first day whereas in the subquery we want to look at purchases but we’re interested in the ones that actually happened on day one so we could call this first day and and we can use a wear filter to say that first day user day needs to be equal to one so this is a way that we can use to look at the purchases that happened on the first day now when we make this list we need to make sure that we are use looking at the same user right and to do that we can say end first day user ID needs to be the same as next day’s user ID and this ensures that we’re looking at the same user and we’re not getting confused between users and finally what do we need from the list of first day purchases we need the list of products so let me first see if the query runs so it runs there’s no mistakes and now let us review the logic of this query we have purchases which is basically a list of purchases with the added value that we know if it happened on day one on day two on day three and so on and then we are getting all of these purchases the ones that happened after day one and we are also getting the the list of products that they this user bought on day one and we are making sure to exclude those products from our final list and this is a correlated subquery because it is a specific SQL query that provides different results for every row that must run for every row because in the first row we need to get the list of products that user ID 10 has bought on day one and make sure that this product is not in it um and then when we go to another row such as this one we need to get the list of all products that user 13 bought on day one and make sure that 118 is not in those products so this is why it’s a correlated subquery and the final step in our problem is to get the number of these users so instead of selecting star and getting all of the C columns I can say count distinct user ID and if I run this I get 23 checking and this is indeed the right solution so this is one way to solve the problem and hopefully it’s not too confusing but if it is don’t worry it is after all an advanced problem if you go to solution here I do think however that my solution is a bit clearer than what strata scratch provides this is actually a bit of a weird solution but that’s ultimately up to you to decide and I am grateful to strata scratch for providing problems that I can solve for free such as this one welcome to postgress SQL exercises the website that we will use to exercise our SQL skills now I am not the author of this website I’m not the author of these exercises the author is Alis D Owens and he has generously created this website for anyone to use and it’s free you don’t even need to sign up you can go here right away and start working on it I believe it is a truly awesome website in fact the best at uh what it does and I’m truly grateful to Alis there for making this available to all the way the website works is pretty simple you have a few categories of exercises here and you can select a session and once you select a session you have a list of exercises you can click on an exercise and then here in the exercise view you have a question that you need to solve and you see a representation of your three tables we’re going to go into this shortly and then you see your expected results and here in this text box over here you can write your uh answer and then hit run to see if it’s the correct one the results will appear in this lower quadrant over here and if you get stock you can ask for a hint um and uh here there are also a few keyboard shortcuts that you can use and then after you submit your answer uh or if you are completely stuck you can go here and see the answers and and discussion and that’s basically all there is to it now let’s have a brief look at the data and see what that’s about and the data is the same for all exercises and what we have here is the data about a newly opened Country Club and we have three tables here members represents the members of the country club so we have their surname and first name their address their telephone and uh the the date that which they joined and so on and then we have the bookings so whenever a member makes a booking into a facility that event is stored into this table and then finally we have a table of facility where we have information about each facility and U in there we have some some tennis courts some badminton courts uh massage rooms uh and so on now as you may know this is a standard way of representing how data is stored in a SQL system so you have um the tables and for each table you see the columns and for each column you see the name and then the data type right so the data type is the type of data that is allowed into this column and as you know each column has a single data type and you are not allowed to mix multiple data types within each column so we have a few different data types here and they have the postgress um name so in postgress an integer is a whole number like 1 2 3 and a numeric is actually a FL floating Point number such as 2.5 or 3.2 character varying is the same as string it represents a piece of text and if you wonder about this number in round brackets 200 it represents the maximum limit of characters that you can put into this piece of text so you cannot have a surname that’s bigger than 200 characters and then you have a time stamp which represents a specific point in time and this is actually all the data types that we have here and finally you can see that the tables are connected so in the booking table every entry every row of this table represent an event where a certain facility ID was booked by a certain member ID at a certain time for a certain number of slots and the facility ID is the same as the facility ID field in facilities and the M ID field field is the same as the M ID or member ID field in members therefore the booking table is connecting to to both of these table and these logical connections will allow us to use joins in order to build queries that work on all of these three tables together and we shall see in detail how that works finally we have an interesting Arrow over here which represents a self relation meaning that the members table has a relation to itself and if you and if you look here this is actually very similar to the example that I have shown in my U mental models course um for each member we can have a recommended bu field which is the ID of another member the member who recommended them into the club and this basically means that you can join the members table to itself in order to get at the same time information about a specific member and about the member who recommended them and we shall see that in the exercises and clearly the exercises run on post SQL and postgress is one of the most popular open-source SQL systems out there postgress SQL is a specific dialect of SQL which has some minor difference es from other dialects such as my SQL or Google SQL that used is used by bigquery but it is mostly the same as all the others if you’ve learned SQL with another dialect you’re going to be just fine postgress sqle does have a couple of quirks that you should be aware about but I will address them specifically as we solve these exercises now if you want to rock these exercises I recommend keep keeping in mind The Logical order of SQL operations and this is a chart that I have introduced and explained extensively in my mental models course where we actually start with this chart being mostly empty and then we add one element at a time making sure that we understand it in detail so I won’t go in depth on this chart now but in short this chart represents the logical order of SQL operations these are are all the components that we can assemble to build our SQL queries they’re like our Lego building blocks for for SQL and these components when they’re assembled they run in a specific order right so the chart represents this order it goes from top to bottom so first you have from then you have where and then you have all the others and there are two very important rules that each operation can only use data produced above it and an operation doesn’t know anything about data produced below it so if you can keep this in mind and keep this chart as a reference it will greatly help you with the exercises and as I solve the exercises you will see that I put a lot of emphasis on coming back to this order and actually thinking in this order in order to write effective queries let us now jump in and get started with our basic exercises so I will jump into the first exercise which is retrieve everything from a table so here I have my question and how can I get all the information I need from the facilities table and as you know all my data is represented here so I can check here to see where I can find the data that I need now as I write my query I aim to always start with the front part why start with the front part first of all it is the first component that runs in The Logical order so again if I go back to my chart over here I can see that the from component is the first and that makes sense right because before I do any work I need to get my data so I need to tell SQL where my data is so in this case the data is in the facilities table next I need to retrieve all the information from this table so that means I’m not going to drop any rows and I’m going to select all the columns and so I can simply write select star and if I hit run I get the result that I need here in this quadrant I can see my result and it fits the expected results now the star is a shortcut for saying give me all of The Columns of this table so I could have listed each column in turn but instead I took a shortcut and used a star retrieve specific columns from a table I want to print a list of all the facilities and their cost to members so as always let’s start with the front part where is the data that we need it’s in the facilities table again and now the question is actually not super clear but luckily I can check the expected results so what I need are two columns from this table which is name and member cost so to get those two columns I can write select name member cost hit run and I get the result that I need so if I write select star I’m going to get all the columns of the table but if I write the name of specific columns separated by comma I will get uh only those columns specifically control which rows are retrieved we need a list of facilities that charge a fee to members so we know that we’re going to work with the facilities table and now we need to keep certain rows and drop others we need to keep only the rows that charge a fee to members so what component can we use in order to do this if I go back to my components chart I can see that right after from we have the we component and the we component is used to drop rows that we don’t need right so in after getting the facilities table I can see I can say where member cost is bigger than zero meaning that they charge a fee to members and finally I can get all of the columns from this control which rows are retrieved part two so like before we want the list of facilities that charge a fee to members but our filtering condition is now a bit more complex because we need that fee to be less than 150th of the monthly maintenance cost so I copied over the code from the last exercise we’re getting the data from our facilities list and we’re filtering for those where the member cost is bigger than zero and now we need to add a new condition which is that that fee which is member cost is less than 150th of the monthly maintenance cost so I can take monthly maintenance over here and divide it by 50 and I have my condition now when I have multiple logical conditions in the wear I need to link them with the logical operator so SQL can figure out how to combine them because the final result of all my conditions needs to be a single value which is either true or false right so let’s see how to do this in my mental models course I introduced the Boolean operators and how they work so you can go there for more detail but can you figure out which logical operator do we need here to chain these two conditions as suggested in the question the operator that I need is end so I can put it here here and what end does is that both of these conditions need to be true for the whole expression to evaluate to true and for the row to be kept so only the rows where both of these conditions are true will be kept and all other rows will be discarded now to complete my exercise I just need to select a few specific columns because we don’t want to return all the columns here and I think that I will cheat a bit by copying them from the expected results and putting them here but normally you would look at the table schema and figure out which columns you need and that completes our exercise basic string searches produce a list of all facilities with the word tennis in their name so where is the data we need it’s in the CD facilities table next question do I need all the rows from this table or do I need to filter out some rows well I only want facilities with the word tennis in their name so clearly I need a filter therefore I need to use the wear statement how can I write the wear statement I need to check the name and I need to keep only facilities which have tennis in their name so I can use the like statement here to say that the facility needs to have tennis in its name but what this wild card signify is that we don’t care what precedes tennis and what follows tennis it could be zero or more characters before it and after it we just care to check that they have tennis in their name and finally we need to select all all the columns from these facilities and that’s our result beware like I said before of your use of the quotes So what you have here is a string it’s a piece of text that uh allows you to do your match therefore you need single quotes if you as it’s likely to happen used double quotes you would get an error here and the error tells you that the column tenis the does not exist because double quotes are used to represent column names and not pieces of text so be careful with that matching against multiple possible values can we get the details of facilities with id1 and id5 so where is my data is in the facilities table and do I need all the rows from this table or only certain ones I need only certain rows because I want those that have id1 and id5 so I need to use a wear statement Now what are my conditions here their ID actually facility ID equals 1 and facility ID equals 5 so I have my two logical conditions now what operator do I need to use in order to chain them I need to use the or operator right because only one of these need needs to be true in order for the whole expression to evaluate to true and in fact only one of them can be true because it’s impossible for the idea of a facility to be equal to one and five at the same time therefore the end operator would not work and what we need is the or operator and finally we need to get all the data meaning all the columns about this facility so I will use select star the problem is now solved but now let’s imagine that tomorrow we need this query again and we need to include another id id 10 so what we can do is put or facility ID equals 10 but this is becoming a bit unwieldy right because imagine having a list of 10 IDs and then writing or every time and it’s it’s not very scalable as an approach approach so as an alternative we can say facility ID in and then list the values like one and five so if I take this and make it into my condition I will again get the same result I will get the the solution but this is a more elegant approach and it’s also more scalable because it’s much easier to come back and insert other IDs inside this list so this is a preferred solution in this case and logically what in is doing is looking at the facility ID for each row and then checking whether that ID is included in this list if it is it returns true therefore it keeps the row if it’s not returns false therefore it drops the row and we shall see a bit later that the in uh notation is also powerful because in this case we have a static list of IDs we know that we want IDs one and five but in more advanced use cases instead of a static list we could provide another query a SQL query or a subquery that would dynamically retrieve a certain list and then we could use that in our query so we shall see that in later exercises classify result into buckets produce a list of facilities and label them cheap or expensive based on their monthly maintenance so we want to get our facilities do we need a filter do we need to drop certain rows no we actually don’t we want to get all facilities and then we want to label them and we need to select the name of the facility and then here we need to provide the label so what SQL statement can we use to provide a text level label according to the value of a certain column what we need here is a case statement which implements conditional logic which implements a branching right it’s similar to the if else statements in other programming languages because if the monthly maintenance cost is more than 100 then it’s expensive otherwise it’s cheap so this call for a case statement now I always start with case and end with end and I always write these at the beginning so I don’t forget them and then for each condition I write when and what is the condition that I’m interested in monthly maintenance being above 100 that’s my first condition what do I do in that case I output a piece of text which says expensive and remember single quotes for test text next I could write the next condition explicitly but actually if it’s not 100 then it’s less than 100 so all I need here is an else and in that case I need to Output the piece of text which says cheap and finally I have a new column and I can give it a label I can call it cost and I get my result so whenever you need to put values into buckets or you need to label values according to certain rules that’s usually when you need a case statement working with dates let’s get a list of members who joined after the start of September 2012 so looking at these tables where is our data it’s in the members table so I will start writing this and now do I need to filter this table yes I only want to keep members that joined after a certain time and now how can I run this the condition on this table I can say where join date is bigger than 2012 September 01 so luckily in SQL and in postgress filtering on dates is quite intuitive even though here we have a time stamp that represents a specific moment in time up to the second we can say bigger or equal actually because we also want to include those who joined on the first day we can write bigger or equal and just specify the the date and SQL will fill in the the rest of the remaining values and the filter will work and next we want to get a few columns for these members so I will copy paste here select and this solves our query removing duplicates and ordering results we want an ordered list of the first 10 surnames in the members table and the list must not contain duplicates so let’s start by getting our table which is the members table now we want to see the surnames so if I write this I will see that there are surnames which are shared by members so there are actually duplicates here so what what can we do in SQL in order to remove duplicates we have seen in the mental models course that we have the distinct keyword and the distinct is going to remove all duplicate rows based on the columns that we have selected so if I run this again I will not see any duplicates anymore now the list needs to be ordered alphabetically as I see here in the expected results and we can do that with the order by statement and when you use order by on a piece of text the default behavior is that the text is ordered alphabetically and uh if I were to use Des sending then it would be ordered in Reverse alphabetical order however that’s not what I need I need it in alphabetical order so now I see that they are ordered and finally I want the first 10 surnames so how can I return the first 10 rows of my result I can do that with the limit statement so if I say limit 10 I will get the first 10 surnames and since I have ordered alphabetically I will get the first 10 surnames in alphabetical order and this is my result now going back to our map over here we have the from which gets a table we have a where which drops rows that we don’t need from that table and then all the way down here we have the select which gets the columns that we need and then we have the distinct right and the distinct needs to know which columns we need because it’s it drops duplicates based on these columns so in this example over here we’re only taking a single column surname so the distinct is going to drop duplicate surnames and then at the end of it all when all the processing is done we can order our results and then finally once our results are ordered we can do a limit to limit the number of rows that we return so I hope this makes sense combining results from multiple queries so let’s get a combined list of all surnames and all facility names so where are the surnames there in CD members and from CD m mbers I can select surname right and this will give me the list of all surnames and where are the facility names there are in CD facilities and I could say select name from CD facilities and I would get a list of all the facilities now we have two distinct queries and they both produce a list or a column of text values and we want to combine them what does it mean we want to stack them on top of each other right and how does that work well if I just say run query like this I will get an error because I have two distinct query here queries here and they’re not connected in any way but when I have two queries or more defining tables and I want to stack them on top of each other I can use the union statement right and if I do Union here I will uh get what I want because all the surnames will be stacked uh vertically with all the names and I will get a unique list containing both of these columns now as I mentioned in the mental models course typically when you have just Union uh it means Union distinct and actually other systems like bigquery don’t allow you to write just Union they want you to specify Union distinct and what this actually does is that after stacking together these two tables it removes all duplicate rows and uh the alternative to this is Union all which um does not do this it actually keeps all the rows and as you know we have some duplicate surnames and then we get them here and it doesn’t fit with our result but if you write just Union it will be Union distinct and you won’t have any duplicates and if you look at our map for The Logical order of SQL operations we are getting the data from a certain table and uh filtering it and then doing all sorts of operations and um on on this data and then we are selecting The Columns that we need and then we can uh remove the the duplicates from this one table and then what comes next is that we could combine this table U with other tables right we can tell SQL that we want to Stack this table on top of another table so this is where the union comes into play and only after we have combined all the tables only after we have stacked them all up on top of each other we can order the results and limit the results also remember and I showed this in detail in the mental models course um when I combine two or more table tables with a union what I need is for them to have the exact same number of columns and all of the columns need to have the same data type so in this case both tables have one column and this column is a text so the the union works but if I were to add another column here and it’s an integer column it would not work because the union query must have the same number of columns right I will get an error however if I were to add an integer column in the second position in both tables they would work again because again I have the same number of columns and they have the same data type simple aggregation I need the sign up date of my last member so I need to work with the members table and we have a field here which is join date and I need to get the latest value of this date the time when a member last joined right so how can I do that I can take my join date field and run an aggregation on top of it what is the correct aggregation in this case it is Max because when it comes to dates Max will take the latest date whereas mean will take the earliest date and I can label this as latest and get the result I need now how aggregations work they are uh functions that look like this you write the name of the function and then in round brackets you provide the arguments the first argument is always the column on which to run the aggregation and what the aggregation does is that it takes a list of values could be 10 100 a million 10 million it doesn’t matter it takes a long list of values and it compresses this list to a single value it um does like we’ve seen in this case taking all of the dates and then returning the latest date now to place this in our map we get the data from the table we filter it and then sometimes we do a grouping which we we shall see later in the exercises but whether we do grouping or not here we have aggregations and if we haven’t done any grouping the aggregation works at the level of all the rows so in the absence of grouping as in this case the aggregation will look at all the rows in my table except for the rows that I filtered away but otherwise it will look at all the rows and then it will compress them into a single value more aggregation we need the first and last name of the last member who signed up not just the date so in the previous exercise we saw that we can say select Max join date from members and we would get the last join date the date when the last member signed up right so given that I want the first and the last name you might think that you can say first name and surname in here but this actually doesn’t work this gives an error the error is that the column first name must appear in the group by clause or be used in a aggregate function now the meaning behind this error and how to avoid it is described in detail in the mental models course in the group by section but the short version of it is that what you’re doing here is that with this aggregation you’re compressing join date to a single value but you’re doing no such compression or aggregation for first name and surname and so SQL is left with the um instruction to return something like this and as you can see here we have a single value but for these columns we have multiple values and this does not work in SQL because you need all columns to have the same number of values and so it it throws an error and what we really need to do here is to take this maximum join date and use it in a wear filter because we only want to keep that row which corresponds to the latest join date so we can take the members table and get the row where join date is equal to the max join date and from that select the name and the surname unfortunately this also doesn’t work so what we saw in the course is that you cannot you’re not allowed to use aggregations inside wear so you cannot use max inside where and the reason why is that actually pretty clear because aggregations happen at this stage in the in the process and aggregations need to know whether a group ey has occurred or not they need to know whether they have to happen over all the rows in the table or only within the groups defined by the group ey and when we are at the where stage the groupy hasn’t happened yet so we don’t know at which level to execute the aggregations and because of this we are not allowed to do aggregations inside the where statement so how can we solve the problem now well a a sort of cheating solution would be if we knew the exact value of join date we could place it here and then our filter would work we’re not using an aggregation and we could put join date in here to display it as well and that would would work however this is a bit cheating right because um the maximum join date is actually a dynamic value it will change with time so we don’t want to hardcode it we want to actually um compute it but because this is not allowed what we actually need is a subquery and the subquery is a SQL query that runs within a query to return a certain result and we can have a subquery by opening round brackets here and write writing a a query and in this query we need to go to the members table and select the maximum join date and this is our actual solution so in this execution you can imagine that SQL will go here inside the subquery run this get the maximum jointed place it in the filter uh keep only the row for the latest member who has joined and then retrieve what we need about this member let us now move to the joints and subqueries exercises the first exercise retrieve the start times of members bookings now we can see that the information we need is spread out into tables because we want the start time for bookings that and that information is in the bookings table but we want to filter to only get members named David farel and the name of the member is contained in the members table so because of that we will need a join so if we briefly look at the map for the order of SQL operations we we can see here that from and join are really the same uh step um and how this works is that in the from statement sometimes uh all my data is in one table and then I just provide the name of that table but sometimes I need to combine two or more different tables in order to get my data and in that case I would use the join but everything in SQL works with tables right so when I when I take two or more tables and combine them together at the end all I get is just another table and this is why from and join are actually the same component and they are the same step so as usual let us start with the front part and we need to take the booking table and we need to join it on the members table and I can give an alas to each table to make my life easier so I will call this book and I will call this mem and then I need to specify The Logical condition for joining this table and The Logical condition is that the M ID column in the booking table is really the same thing as the M ID column in the members table concretely you can imagine um SQL going row by Row in the booking table and looking at the M ID and then checking whether this m ID is present in the members table and if it’s present it combines the row uh the current Row from bookings with the matching Row for members does this with all the matching rows and then drops rows which don’t have a match and we saw that in detail in the mental models course so I’m not going to go in depth into it now that we have our table which is uh comes from the joint of members and bookings we can properly properly filter it and what we want is that the first name column is David in the column which comes from the members table right so m. first name is indicating the parent table and then the column name and the surname is equal to FAL and remember single quotes when using pieces of text this is a where filter you have two logical conditions and then we use the operator end because both of them need to be true so now we have uh filtered our data and finally we need to select the start time and that’s our query now remember that when we use join in a query what’s implied is that we are using inner join and there are several types of join but inner joint is the most common so it’s the default one and what inner joint means is that it’s going to return uh from the two tables that we’re joining is going to return only the rows that have a match and all the row that don’t have a match are going to be dropped so if there’s a row in bookings and it has a m ID that doesn’t exist in the members table that row will be dropped and conversely if there’s a row in the members table and it has a m ID that is not referenced in the booking table that row will also be dropped and that’s an inner join work out the start times of bookings for tennis courts so we need to get the facilities that are actually tennis courts and then for each of the facility we’ll have several bookings and we need to get the start time for those uh bookings and it will be in a specific date so we know that we need the data from these two tables because the name of the facility is here but the data about the bookings is here so I will go from CD facilities join CD bookings on what are the fields that we can join on logically now let me first give an alias to these tables so I will call this fox and this I will call book and now what I need to see is that the facility ID matches on both sides now we can work work on our filters so first of all I only want to look at tennis courts and if you look at the result here um it means that in the name of the facility we want to see tennis and so we can filter on uh string patterns on text patterns by using the like uh command so I can take facilities name and get it like tennis and the percentage signs are um wild cards which means that tennis could be preceded and followed by zero or more characters we don’t care we just want to get those strings that have tennis in them but that’s not enough as a condition we also need the booking to have happened on a specific date so I will put an end here so end is the operator we need because we’re providing two logical conditions and they both need to be true so end is what we need and then I can take the start time from the booking table and um say that it should be equal to the date provided in the instructions because I want the booking to have happened in this particular date however this will not work so I can actually complete the query and show you that it will not work because here we get zero results so can you figure out why this um command here did not work now I’m going to write a few comments here and uh this is how you write them and they are just pieces of text they’re not actually executed as code and I’ll just use them to show you what’s going on so the value for start time looks like this so this is a time stamp and is showing a specific point in time but the date that we are providing for the comparison looks like this so as you can see we have something that is uh less granular because we we’re not showing all of this data about hour minute and uh and second now in order to compare these two things which are different SQL automatically fills in uh this date over here and what it does is that since there’s nothing there it puts zeros in there and now that it has made this um extension it’s going to actually compare them so when you look at this uh comparison over here between these two elements this comparison is false false because the hour is different now when we write this uh filter command over here SQL is looking at every single start time and then comparing it with this value over here which is the very first moment of that date but there’s no start time that is exactly like this one so basically this is always false and thus we get uh zero rows in our result so what is the solution to this before when we take a start time from the data before comparing it we can put it into the date function and if I take my example here if I put it into the date function it’s going to drop that extra information about hour minute and second and it’s only going to keep uh the information about the date so once I do this if I uh if I pass it to the date function before comparing it to my reference date now this one is going to become the result which is this one and then I’m going to compare it with my reference date and then this is going to be true so all this to say that before we compare start time with our reference date we need to reduce its granularity and we need to reduce it to its uh to its date so if I run the query now I will actually get my start times and after this I just need to add the name and finally I need to order by time so I need to order bu um book start time there is still a small error here so sometimes you just have to look at what you get and what’s expected and if you notice here we are returning data about the table tennis facility but we’re actually just interested in tennis court so what are we missing here the string filter is not precise enough and we need to change this into tennis court and now we get our results produce a list of all members who have recommended another member now if we look at the members table we have all these data about each member and then we know if they were recommended by another member and recommended by is the ID of the member who has recommended them and because of this the members table like we said has a relation to itself because one of its column references its ID column so let’s see how to put this in practice so to be clear I simply want a list of members who appear to have recommended another member so if I wanted just the IDS of these people my task would be much simpler right I would go to the members table and then I could select recommended by and then I will put a distinct in here to avoid repetitions and what I would get here is the IDS of all members who have recommended another member however the problem does not want this because the problem wants the first name and Sur name of these uh of these people so in order to get the first name and the name of these people I need to plug this ID back into the members table and get the the data there so for example if I went to the members table and I selected everything where the M ID is 11 then I would get the data for this first member but now I need to do this for all members so what I will have to do is to take the members table and join it to itself and the first time I take the table I’m looking at the members quite simply but the second time I take the members table I’m looking at data about the recommenders of the members so I will call this second instance re so both of these they come from the same table but they’re now two separate instances and what is the logic to join these two tables the members table has this recommended by field and we take the ID from recommended by and we plug it back into the table into M ID to get the data about the recommenders and now we can go into the recommenders which we got by plugging that ID and get their first name and surname I want to avoid repetition because a member may have been recommending multiple members but I want to avoid repetition so I will put a distinct to make sure that I don’t get any uh repeated rows at the end and then finally I can order by surname and first name and I get my result so I encourage you to play with this and experiment a bit until it is clear and in my U mental models course I go into depth into the self joint and uh do a visualization in Google Sheets that also makes it uh much clearer produce a list of all members along with a recommender now if we look at the members table we have a few column and then we have the recommended by column and sometimes we have the ID of another member who recommended this member um it can be repeated because the same member may have recommended multiple people and then sometimes this is empty and when this is empty we have a null in here which is the value that SQL uses to represent absence of data now let us count the rows in members so you might know that to count the rows of a table we can do a simple aggregation which is Count star and we get 31 and let’s just make a note of this that members has 31 rows because in the result we want a list of all members so we must ensure that we return 31 rows in our results now I’m going to delete this select and as before I want want to go for each member and check the ID they have here in recommended bu and then plug this back into the table into M ID so I can get the data about the recommender as well and I can do that with a self jooin so let me take members and join on itself and the first time I will call it Ms and the second time I will call it Rex and the logic for joining is that in Ms recommended by is the same um is connected to to Rex M ID so this is taking the ID in the recommended by field and plugging it back into me ID to get the data about the recommender now what do I want from this I want to get the first name of the member and the last name uh surname and then the first name and last name of the recommender uh surname great so it’s starting to look like the right result but how many rows do we think we have here and in order to count the rows I can do select count star from and then if I simply take this table uh if I simply take this query and en close it in Brackets now this becomes a a subquery so I can ah the subquery must have an alias so I can give it an alias like this and I get 22 so how this works is that first SQL will compute the content of the subquery which is the table that we saw before and then it will uh we need to assign it an alas otherwise it doesn’t work this changes a bit by System but in post you need to do this so we we call it simply T1 and then we run a count star on this table to get the number of rows and we see that the result has 22 rows and this is an issue because we saw before that members has 31 rows and that we want to return all of the members therefore our result should also have 31 rows so can you figure figure out why are we missing some rows here now the issue here is that we are using an inner join so remember when we don’t specify the type of joint it’s an inner joint and what does an inner joint do it keeps only rows that have matches so if you we saw before that in members sometimes this field is empty it has a null value because U you know maybe the member wasn’t recommended by anyone maybe they just apply it themselves and what happens when we use this in an inner joint and it has a null value the row for that me member will be dropped because obviously it cannot have a match with M ID because null cannot match with with anything with any ID and so that row is dropped and we lose it however that’s not what we want to do therefore instead of an inner join we need to use a left join here the the left join will look at the left table so the table that is left of the join command and it will make sure to keep all the rows in that table even the rows that don’t have a match in the rows that don’t have a match it will not drop them it will just put a null in the values that correspond to the right table and if I run the count again uh I will get 31 so now I have I’m keeping all the members and I have the number of rows that I need so now I can get rid of all of these because I know I have the right amount of of rows and I can um get my selection over here and it would actually help if we could make this a bit uh more ordered and a assign aliases to the columns so I will follow the expected results here and call this m first name me surname W first name Rec surname now we have the proper labels and you can see here that we always have the name of the member but some member weren’t recommended by anyone and therefore for the first and last name of the recommender we simply have null values and this is what the left join does the last step here is to order and we want to order by the last name and the first name of each member and we finally get our result so typically you use inner joints which is the default joint because you’re only interested in the rows from both tables that actually have a match but sometimes you want to keep all the data about one table and then you would put that table on the left side and do a left join as we did in this case produce a list of all members who have used a tennis court now now for this problem we need to combine data from all our tables because we need to get look at the members and we need to look at their bookings and we need to check what’s the name of the facility for their bookings so as always let us start with the front part and let us start by joining together all of these tables CD facilities on facility ID and then I want to also join on members and that is my join so we can always join two or more tables in this case we’re joining three tables and how this works is that the first join creates a new table and then this new table is joined with the with the next one over here and this is how multiple joints are managed now I have my table which is the join of all of these tables and um we we’re only interested in members who have used the tennis court if a member has made no bookings um we are we don’t we’re not interested in that member and so it’s okay to have a join and not a left join and we’re for each booking we want to see the name of the facility and if there was a booking who didn’t have the name of the facility we wouldn’t be interested in that booking anyway and so um this joint here also can be an inner join and doesn’t need to be a left join this is how you can think about whether to have a join or left join now we want the booking to include a tennis court so we can filter on this table and we will look at the name of the facility and uh make sure that it has tennis court in it with the like operator and now that we have filtered we can get the first name and the surname of the member and we can get the facility name so here we have a starting result now in the expected result we have merged the first name and the surname into a single string and um in SQL you can do this with a concatenation operator which is basically taking two strings and putting them together into one string now if I do this here I will get um something like this and so this looks a bit weird and what I want to do here is to add an empty space in between and again concatenate it and now the names will look uh will look fine I also want to label this as member and this other column as facility to match the expected results next I need to ensure that there is no duplicate data so at the end of it all I will want to have distinct in order to remove duplicate rows and then I want to order the final result by member name and facility name so order by member and then facility and this will work because the order bu coming second to last coming at the end of our logical order of SQL operations over here the order by is aware of the alas is aware of the label that I have that I have put on the columns and here I get the results that I needed not a lot happening here to be honest it’s just that we’re joining three tables instead of two but it still works um just like uh any other join and then concatenating the strings filtering according to the facility name and then removing duplicate rows and finally ordering produce a list of costly bookings so we want to see all bookings that occurred in this particular particular day and we want to see how much they cost the member and we want to keep the bookings that cost more than $30 so clearly in this case we also need the information from from all tables because if you look at the expected results we want the name of the member which is in the members table the name of the facility which is in the facilities table and the cost for which we will need the booking table so we need to start with a join of these three tables and since we did it already in the last exercise I have copied the code for that uh join so if you want more detail on this go and check the last exercise as well as I have copied the code to get the first name of the member by concatenating strings and the name of the of the facility now we need to calculate the cost of each booking so how does it work looking at our data so we have here a list of bookings and um a booking is defined as a number of slots and a slot is a one uh is a 30 minute usage of that facility and then we also have mid which tells us whether the member is a guest or not I mean whether the person is a guest or a member because if mid is zero then that person is a guest otherwise that person is a member and then I also know the facility that this person booked and if I go and look at the facility it has uh two different prices right one price uh is for members the other price is for guests and the price applies to the slots so we have all of the ingredients that we need for the cost in our join right and to convince ourselves of that let us actually select the here so in Booking I can see facility ID member ID and then slots and then in facility I can see the member cost the guest cost and I guess that’s all I need really to calculate the cost and as you can see after the join I’m in a really good position because for each row I do have all of these values placed on each row so now I just have to figure out how to combine all of these values in order to get the cost now the way that I can get the cost is that I can look at the number of slots and then I need to multiply this by the right cost which is either member cost or guest cost and how do I know which of these to pick if it depends on the M ID if the M id M ID is zero then I will use the guess cost otherwise I will use the member cost so let me go back to my code here and after this I can say I want to take the slots and I want to multiply it by either member cost or guest cost now how can I put some logic in here that will choose uh either member cost or guest cost based on the ID of this person what can I use in order to make this Choice whenever I have such a choice to make I need to use a case statement so I can start with a case statement here and I will already write the end of it so that I don’t forget it and then in the case statement M what do I need to check for I need to check that the member ID is zero in that case I will use the guest cost and in all other cases I will use the member cost so I’m taking slots and then I’m using this case when to decide by which column I’m going to multiply it and this is actually my cost now let’s take a look at this and so I get this error that the column reference M ID is ambiguous so can you figure out why I got this error what’s happening is that I have joined U multiple tables and the M ID column appears twice now in my join and so I cannot refer to it just by name because SQL doesn’t know which column I want so I have to to reference the parent of the column every time I use it so here I will say that it comes from the booking table and now I get my result so if I see here then um I can see that I have successfully calculated my cost and let’s look at the first row uh first it’s um the me ID is not zero therefore it’s a member and here the member cost is zero meaning that this facility is free for members so regardless of the slots the cost will be zero and let’s look at one who is a guest so this one uh is clearly a a guest and they have uh taken one slot and the member cost is zero but uh so it’s free for members but it costs five per slot for guests so the total cost is five So based on this sanity check the cost looks good now I need to actually filter my table because we have um we should consider only bookings that occurred in a certain day so after creating my new table uh and joining I can write aware filter to drop the rows that I don’t need and I can say this is the the time column that I’m interested in the start time needs to be equal to this date over here and we have seen before that this will not work because start time is a Tim stamp it also shows hour um minute and seconds whereas here is just a date so this comparison will fail and so before I do the comparison I need to take this and reduce it to a date so that I’m comparing Apples to Apples on the time check that that didn’t break anything now we should have significantly fewer rows so now what we need to do is to only keep rows that have a cost which is higher than 30 so can I go here and say end cost bigger than 30 no I cannot do it column cost does not exist right typical mistake but if you look at the logical order of SQL operations first you have the sourcing of the data then you have the wear filter and then all of the logic um by which we calculate the cost happens here and the label cost happens here as well so we cannot um filter on this column on the column cost because the we component has no idea about the uh column cost so this will now work but what we can do is to take all of the logic we’ve done until now and wrap it in round brackets and then introduce a Common Table expression and call this T1 so I will say with T1 as and then I can from T1 and now I can use my filter right so cost bigger than 30 I can select star from this table and I’m starting to get somewhere because the cost has been successfully filtered now I have a lot of columns that I don’t want in my final result that I used to help me reason about the cost so I want to keep member and I want to keep the facility but I don’t want to keep any of these great now as a final step I need to order by cost descending and there’s actually a issue that I have because I copy pasted code from the previous exercise I kept a distinct and you have to be very careful with this especially if you copy paste code anyway for learning it would be best to write it always from scratch but the distinct will remove uh rows that are duplicate and can actually cause an issue now I remove the distinct and I get the um solution that I want and if you look here we have if you look at the last two rows you can see that they’re absolutely identical and so the distinct would remove them but there are two uh bookings that happen to be just the same uh in our data and we want to keep them we don’t want to delete them so having distinct was a mistake in this case to summarize what we did here first we joined all the tables so we could have all the columns uh that we needed side by side and then we filtered on on the date pretty straightforward and then we took the first name and surname and um concatenated them together as well as the facility name and then we computed the cost and to compute the cost we got the number of slots and we used used a case when to multiply this by either the guest cost or the member cost according to the member’s ID and at the end we wrapped everything in a Common Table expression so that we could filter on this newly computed value of cost and keep only those bookings that had a cost higher than 30 now I am aware that the question said not to use any subqueries technically I didn’t because this is a common table expression but if you look at the author solution it is slightly different than ours so here they did basically the same thing that we did to compute the the cost except that in the case when they inserted the whole uh expression which is fine works just the same the difference is that um in this case they added a lot of logic in the we filter so that they could use a we filter in the first query so clearly they didn’t use any columns that were added at the stage of the select they didn’t use cost for example because like we said that wouldn’t be possible so what they did is that they added the date filter over here and then in this case they added a um logical expression and in this logical expression either one of these two needed to be true for us to keep the row either the M ID is zero meaning that it’s a it’s a guest and so the calculation based on Guess cost ends up being bigger than 30 or the M ID is not zero which means it’s a member and then this calculation based on the member cost ends up being bigger than 30 so this works I personally think that there’s quite some repetition of the cost calculation both by putting it in the we filter and by uh putting it inside the case when and so I think that uh the solution we have here is a bit cleaner because we’re only calculating cost once uh in this case and then we’re simply referencing it thanks to the Common Table expression so if you look at the mental models course you will see that I warmly recommend not repeating logic in the code and using Common Table Expressions as often as possible because I think that they made the code uh clearer and um simpler to to understand produce a list of all members and the recommender without any joins now we have already Sol solved this problem and we have solved it with a self join as you remember we take the members table and join it on itself so that we can get this uh recommend by ID and plug it into members ID and then see the names of both the member and the recommender side by side but here we are challenged to do it without a join so let us go to the members table and let us select the first name and the surname now we actually want want to concatenate these two into a single string and call this member now how can we get data about the recommender without a self-join typically when you have to combine data you always have a choice between a join in a subquery right so what we we can do is to have a subquery here which looks at the recommended by ID from this table and um goes back to the members table and gets the the data that we need so let’s see how that would look let us give an alias to this table and call it Ms and now we need to go back to this table inside the subquery and we can call it Rex and we want to select again the first name and surname like we’re doing here and how are we able to identify the right row inside this subquery we can use aware filter and we want the Rex M ID to be equal to the Mims recommended by value and once we get this value we can call this recommender and now we want to avoid duplicates so after our outer select we can say distinct which will remove any duplicates from the result and then we want to sort I guess by member and recommender and here we get our result so replacing a join with a subquery so we go row by Row in members and then we take the recommended by ID and then we query the members table again inside the subquery and we use the wear filter to plug in that recommended by and find the row where the mem ID is equal to it and then getting first name and surname we get the data about the recommender and uh and that’s how we can do it in the mental models course we discuss the subqueries and um and this particular case we talk about a correlated subquery why is this a correlated subquery because you can imagine that the the query that is in here it runs again for every row because for every row row I have a different value recommended by and I need and I need to plug this value into the members table to get the data about the recommender so this is a correlated subquery because it runs uh every time and it is different for every row of the members table produce a list of costly bookings using a subquery so this is the exact exercise that we did before and as you will remember uh we actually ignored it instructions a bit and we did use not a subquery but a Common Table expression and by reference this is the code that we used and this code works with that exercise as well and we get the result so you can go back to that exercise to see the logic behind this code and why this works and if we look at the author’s uh solution they are actually using a subquery instead of a common table expression so they have an outer quer query which is Select member facility cost from and then instead of the from instead of telling the name of the table they have all of this logic here in this subquery which they call bookings and finally they they add a filter and order now this is technically correct it works but I’m not a fan of uh of writing queries like this I prefer writing them like this as a common table expression and I explain this in detail in my mental models course the reason I prefer this is because U it doesn’t break queries apart so in my case this is one query and this is another query and it’s pretty easy and simple to read however in this case you will start reading this query and then it is broken uh in in two by another query and when people do this sometimes they go even further and here when you have the from instead of a table you have yet another subquery it gets really complicated um so because of these uh two approaches are equivalent I definitely recommend going for a Common Table expression every time and avoiding subqueries unless they are really Compact and you can fit them in one row let us now get started with aggregation exercises and the first problem count the number of facilities so I can go to the facilities table and then when I want to count the number of rows in a table and here every row is a facility I can use the countar aggregation and we get the count of facilities so what we see here is a global aggregation and when you run an aggregation without having done any grouping it runs on the whole table therefore it will take all the rows of this table no matter how many compress them into one number which is determined by the aggregation function in this case we have a count and it returns a total of nine rows so in our map aggregation happens right here so we Source the table we filtered it if needed and then we might do a grouping which we didn’t do in this case but whether we do it or not aggregations happen here and if grouping didn’t happen the aggregation is at the level of the whole table count the number of expensive facilities this is similar to the previous exercise we can go to the facilities table but here we can add a filtering because we’re only interested in facilities that have guest cost greater than or equal to 10 and now once again I can get my aggregation count star to count the number of rows of this resulting table looking again at our map why does this work because with the from We’re sourcing the table and immediately after the wear runs and it drops unneeded rows and then we can decide whether to group by or not and in our case in this case we’re not doing it um but then the aggregations Run so by the time the aggregations run I’ve already dropped the rows in the wear and this is why in this case after dropping some rows the aggregation only sees six rows which is what we want count the number of recommendations each member makes so in the members table we have a field which is recommended by and here is the ID of the member who recommended the member that that this row is about so now we want to get all these uh recommended by values and count how many times they appear so I can go to my members table and what I need to do here is to group by recommended by so what this will do is that it will take all the unique values of this column recommended by and then you will allow me to do an aggregation on all of the rows in which those values occur so now I can go here to select and call this column again and if I run this query I get all the unique values of recommended buy without any repetitions and now I can run an aggregation like count star what this will do is that for recomend recomended by value 11 it will run this aggregation on all the rows in which recommended by is 11 and the aggregation in this case is Count star which means that it will return the number of rows in which 11 appears which in the result happens to be one and so on for all the values what I also want to do is to order by recommended buy to match the expected results now what we get here is almost correct we see all the unique values of this column and we see the number of times that it appears in our data but there’s one discrepancy which is this last row over here so in this last row you cannot see anything which means that it’s a null value so it’s a value that represents absence of data and why does this occur if you look at the original recommended by column there is a bunch of null values in this column because there’s a bunch of member that have null in recommended by so maybe we don’t know who recommended them or maybe they weren’t recommended they just applied independently when you group bu you take all the unique values of the recommended by column and that includes the null value the null value defines a group of its own and the count works as expected because we can see that there are nine members for whom we don’t have the recommended by value but the solution does not want to see this because we only want to see the number of recommendations each member has made so we actually need to drop this row therefore how how can I drop this row well it’s as simple as going to uh after the from and putting a simple filter and saying recommended by is not not null and this will drop all of the rows in which in which that value is null therefore we won’t appear in the grouping and now our results are correct remember when you’re checking whether a value is null or not you need to use the is null or is not null you cannot actually do equal or um not equal because um null is not an act ual value it’s just a notation for the absence of a value so you cannot say that something is equal or not equal to null you have to say that it is not null let’s list the total slots booked per facility now first question where is the information that I need the number of slots booked in the is in the CD bookings and there I also have the facility ID so I can work with that table and now how can I get the total slots for each facility I can Group by facility ID and then I can select that facility ID and within each unique facility ID what type of uh aggregation might I want to do in every booking we have a certain number of slots right and so we want to find all the bookings for a certain facility ID and then sum all the slots that are being booked so I can write sum of slots over here and then I want to name this column total slots uh looking at the expected results but this will actually not work because um it’s it’s two two separate words so I actually need to use quotes for this and remember I have to use double quotes because it’s a column name so it’s always double quotes for the column name and single quotes for pieces of text and finally I need to order by facility ID and I get the results so for facility ID zero we looked at all the rows where facility ID was zero and we squished all of this to a single value which is the unique facility ID and then we looked at all the slots that were occurring in these rows and then we compress them we squished them to a single value as well using the sum aggregation so summing them all up and then we get the slum the sum of the total slots list the total slots booked per facility in a given month so this is similar to the previous problem except that we are now isolating a specific time period And so let’s us think about how we can um select bookings that happened in the month of September 2012 now we can go to the bookings table and select the start time column and to help our exercise I will order by start time uh descending and I will limit our results to 20 and you can see here that start time is a time stamp call and it goes down to the second because we have year month day hour minutes second so how can we check whether any of these dates is corresponds to September 2012 we could add a logical check here we could say that start time needs to be greater than or equal to 2012 September 1st and it needs to be strictly smaller than 2012 October 1st and this will actually work as an alternative there is a nice function that we could use which is the following date trunk month start time let’s see what that looks like so what do you think this function does like the name suggests it truncates the date to a specific U granularity that we choose here and so all of the months are reduced to the very first moment of the month in which they occur so it is sort of cutting that date and removing some information and reducing the granularity I could of course uh have other values here such as day and then every um time stem here would be reduced to its day but I actually want to use month and now that I have this I can set an equality and I can say that I want this to be equal to September 2012 and this will actually work and I also think it’s nicer than the range that we showed before now I’ve taken the code for the previous exercise and copied it here because it’s actually pretty similar except that now after we get bookings we need to insert a filter to isolate our time range and actually we can use this logical condition directly I’ll delete all the rest and now what I need to do is to change the ordering and I actually need to order by the the total slots here and I get my result to summarize I get the booking table and then I uh take the start time time stamp and I truncate it because I’m only interested in the month of that of that time and then I make sure that the month is the one I actually need and then I’m grouping by facility ID and then I’m getting the facility ID and within each of those groups I’m summing all the slots and finally I’m ordering by this uh column list the total slots booked per facility per month in the year 2012 so again our data is in bookings and now we want to see how we how can we isolate the time period of the year 2012 for this table now once again I am looking at the start time column from bookings uh to see how we can extract the the year so in the previous exercise we we saw the date trunk function and we could apply it here as well so we could say date trunk start time um Year from start time right because we want to see it at the Year resolution and then we will get something like this and then we could check that this is equal to 2012 0101 and this would actually work but there’s actually a better way to do it what we could do here is that we could say extract Year from start time and when we look at here we got a integer that actually represents the year and it will be easy now to just say equal to 2012 and make that test so if we look at what happened here extract is a different function than date time because extract is getting the year and outputting it as an integer whereas date time is still outputting a time stamp or a date just with lower granularity so you have to use one or another according to your needs now to proceed with our query we can get CD bookings and add a filter here and insert this expression in the filter and we want the year to be 2012 so this will take care of isolating our desired time period next we want to check the total slots within groups defined by facility fac ID and month so we want a total for each facility for each month as you can see here in the respected results such that we can say that for facility ID zero in the month of July in the year 2012 we uh booked to 170 slots so let’s see how we can do that this basically means that we have to group by multiple values right and facility ID is easy we have it however we do not have the month so how can we extract the month from the start time over here well we can use the extract function right which is which we just saw so if we write it like this and we put month here um this function will look at the month and then we’ll output the month as an actual integer and um the thing is that I can Group by uh the names of columns but I can also Group by Transformations on columns it works just as well SQL will compute uh this expression over here and then it will get the value and then it will Group by that value now when it comes to getting the columns what I usually do is that when I group by I want to see the The Columns in which I grouped so I just copy what I had here and I add it to my query and then what aggregation do I want to do within the groups defined by these two columns I have seen it in the previous exercise I want to sum over the the slots and get the total slots I also want to take this column over here and rename it as month and now I have to order by ID and month and we get the data that we needed so what did we learn with this exercise we learned to use the extract function to get a number out of a date and we use that we have used uh grouping by multiple columns which simply defines a group as the combination of the unique values of two or more columns that’s what multiple grouping does we have also seen that not only you can Group by providing a column name but you can also Group by a logical operation and you should then reference that same operation in the select statement so that you can get the uh value that was obtained find the count of members who have made at least one booking so where is the data that we need it’s in the bookings table and for every booking we have the ID of the member who has made the booking so I can select this column and clearly I can run a count on this column and the count will return the number of nonnull values however this count as you can see is quite inflated What’s Happening Here is that uh a single member can make any number of bookings and now we’re basically counting all the bookings in here but if I put distinct in here then I’m only going to count the unique values of mid in my booking table and this give me gives me the total number of members who have made at least one booking so count will get you the count of non-null values and count distinct will get you the count of unique nonnull values list the facilities with more than 1,000 slots booked so what do we need to do here we need to look at each facility and how many slots they each booked so where is the data for this as you can see again the data is in the bookings table now I don’t need to do any filter so I don’t need the wear statement but I need to count the total slots within each facility so I need a group pi and I can Group by the facility ID and once I do that I can select the facility ID and to get the total slots I can simply do sum of slots and I can call this total slots it’s double quotes for a column name now I need to add the filter I want to keep those that have some of slots bigger than 1,000 and I cannot do it in a where statement right so if I were to write this in a where statement I would get that aggregate functions are not allowed in wear and if I look at my map uh we have been through this again the wear runs first right after we Source the data whereas aggregations happens happen later so the wear cannot be aware of any aggregations that I’ve done for this purpose we actually have the having component so the having component works just like wear it’s a filter it drops rows based on logical conditions the difference is that having runs after the aggregations and it works on the aggregations so I get the data do my first filtering then do the grouping compute an aggregation and then I can filter it again based on the result of the aggregation so I can now now go to my query and take this and put having instead of where and place it after the group pi and we get our result and all we need to do is to order bu facility ID and we get our result find the total revenue of each facility so we want a list of facilities by name along with their total revenue first question as always where is my data so if I want facility’s name it’s in the facilities table but to calculate the revenue I need to know about the bookings so I’ll actually need to join on both of these tables so I will write from CD bookings book join CD facilities fact on facility ID next I will want the total revenue of the facilities but I don’t even have the revenue yet so my first priority should be to compute the revenue let us first select the facility name and here I will now need to add the revenue so to do that I will need to have something like cost times slots and that determines the revenue of each booking however I don’t have a single value for cost I have two values member cost and guest cost and as you remember from previous exercises I need to choose every time which of them to apply and the way that I can choose is by looking at the member ID and if it’s zero then I need to use the guest cost otherwise I need to use the member cost so what can we use now in order to choose between these two variants for each booking we can use the case statement for this so I will say case and then immediately close it with end and I’ll say when uh book M ID equals zero then Fox guest cost I always need to reference the parent Table after a join to avoid confusion else fax member cost so this will allow me to get the C cost dynamically it allows me to choose between two columns and I can multiply this by slots and get the revenue now if I run this I get this result which is the name of the facility and the revenue but I need to ask myself at what level am I working here in other words what does each row represent well I haven’t grouped yet so each row here represents a single booking having joined bookings and facilities and not having grouped anything we are still at the level of this table where every row represent a single booking so to find the total revenue for each facility I now need to do an aggregation I need to group by facility name and then sum all all the revenue I can actually do this within the same query by saying Group by facility name and if I run this I will now get an error can you figure out why I’m getting this error now so I have grouped by facility name and then I’m selecting by facility name and that works well because now this column has been squished has been compressed to show only the unique names for each facility however I am then adding another column which is revenue which I have not compressed in any way therefore this column has a different number of rows than than this column and the general rule of grouping is that after I group by one or more columns I can select by The Columns that are in the grouping and aggregations right so nothing else is allowed so fax name is good because it’s in the grouping revenue is not good because it’s not in the grouping and it’s not an aggregation and to solve this I can simply turn it into an aggregation by doing sum over here and when I run this this actually works and now all I need to do is to sort by Revenue so if I say order by Revenue I will get the result that I need so there’s a few things going on here but I can understand it by looking at my map now what I’m doing is that I’m first sourcing the data and I’m actually joining two tables in order to create a new table where my data is then I’m grouping by a c a column which is the facility name so this compresses the column to all the unique facility name and next I run the aggregation right so the aggregation can be a sum over an existing column but as we saw in the mental models course the aggregation can also be a sum over a calculation I can actually run logic in there it’s very flexible so if I had a revenue column here I would just say sum Revenue as revenue and it would be simpler but I need to do some to put some logic in there and uh this logic involves uh choosing whether to get guest cost or member cost but I’m perfectly able to put that logic inside the sum and so SQL will first evaluate this Logic for each row and then um it will sum up all the results and it will give me Revenue finally after Computing that aggregation I uh select the columns that I need and then I do an order buy at the end find facilities with a total revenue of less than 1,000 so the the question is pretty clear but wait a second we calculate ated the total revenue by facility in the previous exercise so we can probably just adapt that code here’s the code from the previous exercise so check that out if you want to know how I wrote this and if I run this code I do indeed get the total revenue for for each facility and now I just need to keep those with a revenue less than 1,000 so how can I do that it’s a filter right I need to filter on this Revenue column um I cannot use a wear filter because this uh revenue is an aggregation and it was computed after the group buy after the wear so the wear wouldn’t be aware of that uh column but as we have seen there is a keyword there is a statement called having which does the same job as where it filters based on logical conditions however it works on aggregations so I could say having Revenue smaller than 1,000 unfortunately this doesn’t work can you figure out why this doesn’t work in our query we do a grouping and then we compute an aggregation and then we give it a label and then we try to run a having filter on this label if you look now at our map for The Logical order of SQL operations this is where the group by happens this is where we compute our aggregation and this is where having runs and now having is trying to use the Alias that comes at this step but according to our rules having does not know of the Alias that’s assigned at this step because it hasn’t happened yet now as the discussion for this exercise says there are in fact database systems that try to make your life easier by allowing you to use labels in having but that’s not the case with postgress so we need a slightly different solution here note that if I repeated all of my logic in here instead of using the label it would work so if I do this I will get my result I just need to order by Revenue and you see that I get the correct result why does it work when I put the whole logic in there instead of using the label once again the logic happens here and so the having is aware of this logic having happened but the having is just not aware of the Alias however I do not recommend repeating logic like this in your queries because it increases the chances of errors and it also makes them less elegant less readable so the simpler solution we can do here is to take this original query and put it in round brackets and then create a virtual table using a Common Table expression here and call this all of these T1 and then we can treat T1 like any other table so I can say from T1 select everything where revenue is smaller than 1,000 and then order by Revenue remove all this and we get the correct answer to summarize you can use having to filter on the result of aggregation ations unfortunately in postest you cannot use the labels that you assign to aggregations in having so if it’s a really small aggregation like if it’s select some revenue and then all of the rest then it’s fine to say sum Revenue smaller than 1,000 there’s a small repetition but it’s not an issue however if your aggregation is more complex as in this case you don’t really want to repeat it and then your forced to add an extra step to your query which you can do with a common table expression output the facility ID that has the highest number of slots booked so first of all we need to get the number of slots booked by facility and we’ve actually done it before but let’s do it again where is our data the data is in the booking table and uh we don’t need to filter this table but we need we do need to group by the facility ID and then once we do this we can select the facility ID this will isolate all the unique values of this column and within each unique value we can sum the number of slots and call this total slots and if we do this we get the total slots for each facility now to get the top one the quickest solution really would be to order by total slots and then limit the result to one however this would give me the one with the smallest number of slots because order is ascending by default so I need to turn this into descending and here I would get my solution but given that this is a simple solution and it solved our exercise can you imagine a situation in which this query would not achieve what we wanted it to let us say that there were multiple facilities that had the top number of total slots so the top number of slots in our data set is 1404 that’s all good but let’s say that there were two facilities that had this uh this top number and we wanted to see both of them for our business purposes what would happen here is that limit one so the everything else would work correctly and the ordering would work correctly but inevitably in the ordering one of them would get the first spot and the other would get the second spot and limit one is always cutting the output to a single row therefore in this query we would only ever see one facility ID even if there were more that had the same number of top slots so how can we solve this clearly in instead of combining order by and limit we need to figure out a filter we need to filter our table such that only the facilities with the top number of slots are returned but we cannot really get the maximum of some slots in this query because if I tried to do having some slots equals maximum of some slots I would be told that aggregate function calls cannot be nested and if I go back to my map I can see that having can only run after all the aggregations have completed but what we’re trying to do here is to add a new aggregation inside having and that basically doesn’t work so the simplest solution here is to just wrap all of this into a Common Table expression and then get this uh table that we’ve just defined and then select star where the total slots is equal to the maximum number of slots which we know to be 1404 however we cannot hardcode the maximum number of slots because for one we might not know what it is and for and second it uh it will change with time so this won’t work when the data changes so what’s the alternative to hardcoding this we actually need some logic here to get the maximum value and we can put that logic inside the subquery and the subquery will go back to my table T1 and you will actually find the maximum of total slots from T1 so first this query will run it will get the maximum and then the filter will check for that maximum and then I will get uh the required result and this won’t break if there are many facilities that share the same top spot because we’re using a filter all of them will be returned so this is a perfectly good solution for your information you can also solve this with a window function and um which is a sort of row level aggregation that doesn’t change the structure of the data we’ve seen it in detail in the mental models course so what I can do here is to use a window function to get the maximum value over the sum of slots and then I can I will say over to make it clear that this is a window function but I won’t put anything in the window definition because I I just want to look at my whole data set here and I can label this Max slots and if I look at the data here you can see that I will get the maximum for every row and then to get the correct result I can add a simple filter here saying that total slots should be equal to Max slots and I will only want to return facility ID and total slots so this also solves the problem what’s interesting to note here for the sake of understanding window functions more deeply is that the aggregation function for this uh window Clause works over an aggregation as well so here we sum the total slots over each facility and then the window function gets the maximum of all of those uh value and this is quite a powerful feature um and if I look at my map over here I can see that it makes perfect sense because here is where we Group by facility ID and here is where we compute the aggregation and then the window comes later so the window is aware of the aggregation and the window can work on on that so A few different solutions here and overall um a really interesting exercise list the total slots booked per facility per month part two so this is a bit of a complex query but the easiest way to get it is to look at the expected results so what we see here is a facility ID and then within each month of the year 2012 we get the total number of slots and um at the end of it we have a null value here and for facility zero and what we get is the sum of all slots booked in 2012 and then the same pattern repeats repeats with every facility we have the total within each month and then finally we have the total for that facility in the year here so there’s two level of aggregations here and then if I go at the end there’s a third level of aggregation which is the total for all facilities within that year so there are three levels of aggregation here by increasing granularity it’s total over the year then total by facility over the year and then finally total by Facility by month within that year so this is a bit breaking the mold of what SQL usually does in the sense that SQL is not designed to return a single result with multiple levels of aggregation so we will need to be a bit creative around that but let us start now with the lowest level of granularity let’s get this uh this part right facility ID and month and and then we’ll build on top of that so the table that I need is in the bookings table and first question do I need to filter this table yes because I’m only interested in the year 2012 so we have seen that we can use the extract function to get the year out of a Tim stamp which would be start time and we can use this function in a wear filter and what this function will do is that it will go to that time stamp and then we will get an integer out of it it will get a number and then we can check that this is uh the year that we’re interested in and let’s do a quick sanity check to make sure this worked so I will get some bookings here and they will all be in the year 2012 next I need to Define my grouping right so I will need to group by facility ID but then I will also need to group by month however I don’t actually have a column named uh month in this table so I need to calculate it I can calculate it once again with the extract function so I can say extract extract month from start time and once again this will go to the start time and sped out a integer which for this first row would be seven and uh as you know in the group bu I can select a column but I can also select an operation over a column which works just as well now after grouping I cannot do select star anymore but I want to see The Columns that I have grouped by and so let us do a quick sanity check on that it looks pretty good I get the facility ID and the month and I can actually label this month and next I simply need to take the sum over the slots within each facility and within each month and when I look at this I have my first level of granularity and you can see that the first row corresponds to the expected result now I need to add the next level of granularity which is the total within each facility so can you think of how can I add that next level of granularity to my results the key Insight is to look at this uh expected results table and to see it as multiple tables stacked on top of each other one table is the one that we have here and this is uh total by facility month a second table that we will need is the total by facility and then the third table that we will need is the overall total which you could see here at the bottom and how can we stack multiple tables on top of each each other with a union statement right Union will stack all the rows from my tables on top of each other so now let us compute the table which has the total by facility and I will actually copy paste what I have here and and I just need to remove a level of grouping right so if I do this I I will not Group by month anymore and I will not Group by month anymore and once I do this I get an error Union query must have the same number of columns so do you understand this error here so I will write a bit to show you what’s happening so how does it work when we Union two tables let’s say the first table in our case is facility ID month and then slots and then the second table if you look here it’s facility ID and then slots now when you Union these two tables SQL assumes that you have the same number of columns and that the ordering is also identical so here we are failing because the first table has three columns and the second table has only two and not only We are failing because there’s a numbers mismatch but we are also mixing the values of month and Slots now this might work because they’re both integers so SQL won’t necessarily complain about this but it is logically wrong so what we need to do is to make sure that when we’re unioning these two tables we have the same number of columns and the same ordering as well but how can we do this given that the second table does indeed have one column less it does have less information so what I can do is to put null over here so what happens if I do select null this will create a column of a of constant value which is a column of all NS and then the structure will become like this now when I Union first of all I’m going to have the same number of columns so I’m not going to see this uh this error again that we have here and second in u the facility ID is going to be mixed with the facility ID slots is going to be mixed with slots which is all good and then month is going to be mixed with null which is what we want because in some cases we will have the actual month and in some cases we won’t have any anything so I have added uh null over here and I am unioning the tables and if I run the query I can see that I don’t get any error anymore and this is what I want so I can tell that this row is coming from the second table because it has null in the value of month and so it’s showing the total slots for facility um zero in every month whereas this row came from the upper table because it’s showing the sum of slots for a facility within a certain month so this achieves the desired result next we want to compute the last level of granularity which is the total so once again I will select my query over here and and I don’t even need to group by anymore right because it’s the total number of slots over the whole year so I can simply say sum of slots as slots and remove the grouping next I can add the Union as well so that that I can keep stacking these tables and if I run this I get the same error as before so going back to our little uh text over here we are now adding a third table and this table only has slots and of course I cannot this doesn’t work because there’s a mismatch in the number of columns and so the solution here is to also add a null column here and a null column here and so I have the same number of columns and Slots gets combined with slots and everything else gets filled with null values and I can do it here making sure that the ordering is correct so I will select null null and then sum of slots and if I run this query I can see that the result works the final step is to add ordering sorted by ID and month so at the end of all of these unions I can say order by facility ID one and I finally get my result so this is now the combination of three different tables stacked on top of each other that show different levels of granularity and as you can see here in the schema we added null columns to uh two of these tables just to make sure that they have the same number of columns and that they can stack up correctly and now if we look again at the whole query we can see that there are actually three select statements in this query meaning three tables which are calculated and then finally stack with Union and all of them they do some pretty straightforward aggregation the first one um Aggregates by facility ad and month after extracting the month the second one simply Aggregates by facility ID and the third one gets the sum of slots over the whole data without any grouping and then we are adding the null uh constant columns here to make the the column count [Music] match and it’s also worth it to see this in our map of the SQL operations so here um you can see that this order is actually repeating for every table so for each of our three tables we are getting our data and then we are running a filter to keep the year 2012 and then we do a grouping and compute an aggregation and select the columns that we need adding null columns when necessary and then it repeats all over right so for the second table again the same process for the third table the same process except that in the third table we don’t Group by and then when all three tables are done the union r runs the union runs and stacks them all up together and now instead of three tables I only have one table and after the union has run now I can finally order my table and return the result list the total hours booked per named facility so we want to get the facility ID and the facility name and the total hours that they’ve been booked keeping keeping in mind that what we have here are number of slots for each booking and a slot represents 30 minutes of booking now to get my data I will need both the booking table and the facilities table because I need both the information on the bookings and the facility name so I will get the bookings table and the facilities table and join them together next I don’t really need to filter on anything but I need to group by facility so I will Group by facility ID and then I also need to group by facility name otherwise I won’t be able to use this in the select part and now I can select these two columns and to get the total hours I will need to get the sum of the slots so I can get the total number of slots within each facility and I will need to divide this by two right so let’s see what that looks like now superficially this looks correct but there’s actually a pitfall in here and to realize a pitfall I will take some slots as well before dividing it by two and you can see it already in the first row 9911 ided by 2 is not quite 455 so what is happening here the thing is that in postgress when you take an integer number such as some slots the sum of the slots is an integer number and you divide by another integer postgress assumes that you you are doing integer Division and since you are dividing two integers it returns uh an integer as well so that means that um that the solution is not exact if you are thinking in floating Point numbers and the solution for this is that at least one of the two numbers needs to be a Flo floating Point number and so we can turn two into 2.0 and if I run this I now get the correct result so it’s important to be careful with integer division in postest it is a potential Pitfall now what I need to do is to reduce the number of zeros after the comma so I need some sort of rounding and for this I can use the round function which looks like this and this is a typical function in uh in SQL and how it works is that it takes two arguments the first argument is a column and actually this is the column right this whole operation and then the second argument is how many uh figures do you want to see after the zero after the comma sorry so now I can clean this up a bit label this as total hours and then I will need to order by facility ID and I get my result so nothing crazy here really we Source our data from a join which is this part over here and then we Group by two columns we select those columns and U then we sum over the slots divide making sure to not have integer division so we use one of the numbers becomes a floating Point number and we round the result of this column list each Member’s First booking after September 1st 2012 so in order to get our data where does our data leave we need the data about the member and we also need data about their bookings so the data is actually in the members and bookings table so I will quickly join on these [Music] tables and we now have our data do we need a filter on our data yes because we only want to look after September 1st 2012 so we can say where start time is bigger than and it should be enough to just provide the date like this now in the result we need the members surname and first name and their memory ID and then we get to we need to see the first booking in our data meaning the earliest time so again we have an aggregation here so in order to implement this aggregation I need to group by all of these columns that I want to call so surname first name and member ID now that I have grouped by this columns I can select them so now I am I have grouped by each member and now I have all the dates for all their bookings after September 1st 2012 and now how can I look at all these dates and get the earliest date what type of aggregation do I need to use I can use the mean aggregation which will look at all of the dates and then compress them to a single date which is the smallest date and I can call this start time finally I need to order by member ID and I get the result that I needed so this is actually quite straightforward I get my data by joining two tables I make sure I only have the data that I need by filtering on the on the time period and then I group by all the information that I want to see for each member and then within each member I use mean to get the smallest date meaning the earliest date now I wanted to give you a bit of an insight into the subtleties of how SQL Compares timestamps and dates because the results here can be a bit surprising so I wrote three logical Expressions here for you and your job is to try to guess if either of these three Expressions will be true or false so take a look at them and try to answer that as you can see what we have here is a time stamp uh that indicates the 1st of September 8:00 whereas here we have uh simply the indication of the date the 1st of September and the values are the same in all three but my question is are they equal is this uh greater or is this smaller so what do you think I think the intuitive answer is to say that in the first case we have September 1st on one side September 1st on the other they are the same day so this ought to be true whereas here we have again the same day on both sides so this is not strictly bigger than the other one so this should be false and it is also not strictly smaller so this would be false as well now let’s run the query and see what’s actually happening right so what we see here is that we thought this would be true but it’s actually false we thought this would be false but it’s actually uh true and this one is indeed false so are you surprised by this result or is it what you expected if you are surprised can you figure out what’s going on here now what is happening here is that you are running a comparison between two expressions which have a different level of granularity the one on the left is showing you day hour minute seconds and the one on the right is showing you the date only in other words the value on the left is a Tim stamp whereas the value on the right is a date so different levels of precision here now to make the comparison work SQL needs to convert one into the other it needs to do something that is known technically as implicit type coercion what does it mean type is the data type right so either time stamp or date type coercion is when you take a value and you convert it to a different type and it’s implicit uh because we haven’t ask for it and SQL has to do it on its own behind the scenes and so how does SQL choose which one to convert to the other the choice is based on let’s keep the one with the highest precision and convert the other so we have the time stamp with the higher Precision on the left and we need to convert the date into the timestamp this is how SQL is going to handle this situation it’s going to favor the one with the highest Precision now in order to convert a date to a time stamp what SQL will do is that it will add all zeros here so this will basically represent the very first second of the day of uh September 1st 2012 now we can verify which I just showed you I’m going to comment this line and I’m going to add another logical expression here which is taking the Tim stamp representing all zeros here and then setting it equal to the date right here so what do we we expect to happen now we have two different types there will be a type coercion and then SQL will take this value on the right and turn it into exactly this value on the left therefore after I check whether they’re equal I should get true here turns out that this is true but I need to add another step which is to convert this to a Tim stamp and after I do this I get what I expected which is that this comparison here is true so what this notation does in postest is that it does the type coercion taking this date and forcing it into a time stamp and I’ll be honest with you I don’t understand exactly why I need to to do this here I thought that this would work simply by taking this part over here but u i I also need to somehow explicitly tell SQL that I want this to be a time stamp nonetheless this is the Insight that we needed here and it allows us to understand why this comparison is actually false because we are comparing a time stamp for the very first second of September 1st with a time stamp that is the first second of the eighth hour of September 1st and so it fails and we can also see why on on this line the left side is bigger than the right hand side and uh and this one did not actually fool us so we’re good with that so long story short if you’re just getting started you might not know that SQL does this uh implicit type coercion in the background and this dates comparison might leave you quite confused now I’ve cleaned the code up a bit and now the question is what do we need to do with the code in order to match our initial intuition so what do we need to do such that this line is true and the second line is false and this one is still false so we don’t have to worry about it well since the implicit coercion turns the date into a time stamp we actually want to do the opposite we want to turn the time stamp into a date so it will be enough to do the type coion ourselves and transform this into dates like this and when I run this new query I get exactly what I expected so now I’m comparing at the level of precision or granularity that I wanted I’m only looking at the at the date so I hope this wasn’t too confusing I hope it was a bit insightful and that you have a new appreciation for the complexities that can arise when you work with dates and time stamps in SQL produce a list of member names with each row containing the total member count let’s look at the expected results we have the first name and the surname for each member and then every single row shows the total count of members there are 31 members in our table now if I want to get the total count of members I can take the members table and then select the count and this will give me 31 right but I cannot add first name and surname to this I will get uh an error because count star is an aggregation and it takes all the 31 rows and produces a single number which is 31 while I’m not aggregating first name and surname so the standard aggregation doesn’t work here I need an aggregation that doesn’t change the structure of my table and that works at the level of the row and to have an aggregation that works at the level of the row I can use a window function and the window function looks like having an aggregation followed by the keyword over and then the definition of the window so if I do this I get the count at the level of the row and to respect the results I need to change the order a bit here and I get the result that I wanted so a window function has these two main components an aggregation and a window definition in this case the aggregation counts the number of rows and the window definition is empty meaning that our window is the entire table and so this aggregation will be computed over the entire table and then added to each row there are far more details about the window functions and how they work in my mental model course produce a numbered list of members ordered by their date of joining so I will take the members table and I will select the first name and surname and to to produce a numbered list I can use a window function with the row number aggregation so I’ll say row number over so row number is a special aggregation that works only for window functions and what it does is that it numbers the rows um monotonically giving a number to each starting from one and going uh forward and it never assigns the same number to two rows and in the window you need to define the ordering uh for for the numbering so what is the ordering in this case it’s um defined by the join date and by default it’s ascending so that’s good and we can call this row number and we get the results we wanted and again you can find a longer explanation for this with much more detail about the window functions and and row number in the mental models course output the facility ID that has the highest number of slots booked again so we’ve we’ve already solved this problem in a few different ways let’s see a new way to to solve it so we can go to our bookings table and we can Group by facility ID and then we can get the facility ID in our select and then we could sum on slots to get the total slots booked for each facility and since we’re dealing with window functions we can also rank facilities based on the total slots that they have booked and this would look like rank over order by some slots descending and we can call this RK for Rank and if I order by some slots uh descending I should see that my rank works as intended so we’ve seen this in the mental models course you can think of rank as U deciding the outcome of a Race So the person who did the most in this case gets ranked one and then everyone else gets rank two 3 four but if there were two um candidates that got the same score the highest score they would both get rank one because they would both have won the race so to speak and the rank here is defined over the window of the sum of slots descending so that is what we need and next to get all the facilities that have the highest score or we could wrap this into a Common Table expression and then take that table and then select the facility ID and we can label this column total then we will get total and filter for where ranking is equal to one and we get our result aside from how rank works the the other thing to note in this exercise is that we can Define the window based on an aggregation so in this case we are ordering the elements of our window based on the sum of slots and if we look at our map over here we can see that uh we get the data we have our group ey we have the aggregation and then we have the window so the window follows the aggregation and So based on our rules the window has access to the aggregation and it’s able to use it rank members by rounded hours used so the expected results are quite straightforward we have the first name and the surname of each member we have the total hours that they have used and then we are ranking them based on that so the information for this result where is it uh we can see that it’s in the members and bookings tables and so we will need to join on these two tables members Ms join bookings book on M ID and that’s our join now we need to get the total hours so we can Group by our first name and we also need to group by the surname because we will want to display it and now we can select these two columns and we need to compute the total hours so how can we get that for each member we know the slots that they got uh at every booking so we need to get all those those uh slots sum them up and uh every slot represents a 30 minute interval right so to get the hours we need to divide this value by two and remember if I take an integer like sum of slots and divide by two which is also an integer I’m going to have integer division so I won’t have the part after the comma in the result of the division and that’s not what I want so instead of saying divide by two I will say divide by 2.0 so let’s check um how the data looks like this is looking good now but um if we read the question we want to round to the nearest 10 hours so 19 should probably be 20 115 should probably be 120 because I think that we round up when we have 15 and so on as you can see here in the result so how can we do this rounding well we have a Nifty round function which as the first argument takes the column with all the values and the second argument we can specify how do we want the rounding and to round to the nearest 10 you can put -1 here so actually let’s keep displaying the the total hours as well as the rounded value to make sure that we’re doing it correctly so as you can see we are indeed um rounding to to the nearest 10 so this is looking good and for the to understand the reason why I used minus one here and how the rounding function works I will have a small section about it when we’re done with this exercise but meanwhile Let’s uh finish this exercise so now I want to rank all of my rows based on this value here that I have comped computed and since this is an aggregation it will already be available to a window function right because in The Logical order of operations aggregation happen here and then Windows happen afterward and they have access to the data uh from the aggregation so it should be possible to transform this into a window function so think for a moment uh of how we could do that so window function has its own aggregation which in this case is a simple Rank and then we have the over part which defines the window and what do we want to put in our window in this case we want to order by let’s say our um rounded hours and we want to order descending because we want the guest the member with the high hours to have the best rank but uh clearly we don’t have a column called rounded hours what we have here is this logic over here so I will substitute this name with my actual logic and I will get my actual Rank and now I can delete this column here that I was was just looking at and I can sort by rank surname first name small error here I actually do need to show the hour as well so I need to take this logic over here again and call this ours and I finally get my result so to summarize what we are doing in this exercise we’re getting our data by joining these two tables and then we’re grouping by the first name and the surname of the member and then we are summing over the slots for each member dividing by 2.0 to make sure we have an exact Division and uh using the rounding function to round down to the nearest hour and so we get the hours and we use the same logic inside a window function to have a ranking such that the members with the with most hours get rank of one and then the one with the second most hours get rank of two and so on as you can see here in the result and I am perfectly able to use use this logic to Define The Ordering of my window because window functions can use uh aggregations as seen in The Logical order of SQL operations here because window functions occur after aggregations and um and that’s it then we just order by the required values and get our results now here’s a brief overview of how rounding Works in SQL now rounding is a function that takes a certain number and then returns an approximation of that number which is usually easier to parse and easier to read and you have the round function and it works like this the first argument is a value and it can be a constant as in this case so we just have a number or it can be a column um in which case it will apply the round function to every element of the column and the second argument specifies how we want the rounding to occur so here you can see the number from which we start and the first rounding we apply has an argument of two so this means that we really just want to see two uh numbers after the decimal so this is what the first rounding does as you can see here and we we round down or up based on whether the values are equal or greater than five in which case we round up or smaller than five in which case we round down so in this first example two is lesser than five so we just get rid of it and then we have eight eight is greater than five so we have to round up and so when we round up this 79 becomes an 80 and this is how we get to this first round over here here then we have round with an argument of one which leaves one place after the decimal and which is this result over here and then we have round without any argument which is actually the same as providing an argument of zero which means that we really just want to see the whole number and then what’s interesting to note is that the rounding function can be generalized to continue even after we got rid of all the decimal part by providing negative arguments so round with the argument of-1 really means that I want to round uh round this number to the nearest 10 so you can see here that from 48,2 192 we end up at 48,2 190 going to the nearest 10 rounding with a value of -2 means going to the nearest 100 so uh 290 the nearest 100 is 300 right so we have to round up and so we get this minus 3 means uh round to the nearest thousand so if you look at here we have 48,3 and so the nearest thousand to that is 48,000 minus 4 means the nearest 10,000 ,000 so given that we have 48,000 the nearest 10,000 is 50,000 and finally round minus 5 means round to the nearest 100,000 and um the given that we have 48,000 the nearest 100,000 is actually zero and from now on as we keep going negatively we will always get zero on this number so this is how rounding Works in brief it’s a pretty useful function not everyone knows that you can provide it negative arguments actually I didn’t know and then when I did the first version of this course um commenter pointed it out so shout out to him U don’t know if he wants me to say his name but hopefully now you understand how rounding works and you can use it in your problems find the top three Revenue generating facilities so we want a list of the facilities that have the top three revenues including ties this is important and if you look at the expected results we simply have a the facility name and a bit of a giveaway of what we will need to use the rank of these facilities so there’s this other exercise that we did a while back which is find the total revenue of each facility and from this exercise I have taken the code that uh allows us to get to this point where we see the name of the facility and the total revenue for that facility and you can go back there to that exercise to see in detail how this code works but in brief we are joining the bookings and Facilities tables and we are grouping by facility name and then we are getting that facility name and then within each booking we are Computing the revenue by taking the slots and using a case when to choose whether to use guest cost or member cost and so this is how we get the revenue for each single booking and now given that we grouped by facility we can sum all of these revenues to get the total revenue of each facility and this is how we get to this point given this partial result all that’s left to do now is to rank these facilities based on their revenue so what I need here is a window function that will allow me to implement this ranking and this window function would look something like this I have a rank and why is rank the right function even though they sort of uh gave it away because if you want the facilities who have the top revenues including ties you can think of it as a race all facilities are racing to get to the top revenue and then if two or three or four facilities get that top Revenue if there are more in the top position you can’t arbitrarily say oh you are first and they are second second you have to give them all the rank one because you have to tell them um recognize that they are all first so these type of problems uh call for a ranking solution so our window function would use rank as the aggregation and then we need to Define our window and how do we Define our window we Define the ordering for the ranking here so we can say order by Revenue descending such that the high highest revenue will get rank one the next highest will get rank two and so on now this will not work because I don’t have the revenue column right I do have something here that is labeled as Revenue but the ranking part will not be aware of this label however I do have the logic to compute the revenue so I could take the logic right here and paste it over here and I will add a comma now this is not the most elegant looking code but let’s see if it works and we need to order by Revenue descending to see it in action and if I order by Revenue descending you can in fact see that the facility with the highest revenue gets rank one and then it goes down from there so now I just need to clean this up a bit first I will remove the revenue column and then I will remove the ordering and what I need here for the result is to keep only the facilities that have rank of three or smaller so ranks 1 2 three and there’s actually no way to do it in this query so I actually have to wrap this query into a common table expr expression and then take that table and say select star from T1 where rank is smaller or equal to three and I will need to order by rank ascending here and I get the result I needed so what happened here we built upon the logic of getting the total revenue for each facility and again we saw that in the previous exercise and um then what we did here is that we added a rank window function and within this rank we order by this total revenue so this might look a bit complex but you have to remember that when we have many operations that are nested you always start with the innermost operation and move your way up from there so the innermost operation is a case when which chooses between guest cost and member cost and then multiplies it by slots and this inner operation over here is calculating the revenue for each single booking the next operation is an aggregation that takes that revenue for each single booking and sums this these revenues up to get the total revenue by each facility and finally the outermost operation is taking the total revenue for each facility and it’s ordering them in descending order in order to figure out the ranking and the reason all of this works we can go back to our map of SQL operations you can see here that after getting the table the first thing that happens here is the group buy and then the aggregations and here is where we sum over the total of of Revenue and after the aggregation is completed we have the window function so the window function has access to the aggregation and can use them when defining the window and finally after we get the ranking we we have no way of isolating only the first three ranks in this query so we need to do it with a common table expression and if you look here back to our map this makes sense because what components do we have in order to filter our table in order to only keep certain rows we have the wear which happens here very early and we have the having and they both happen before the window so after the window function you actually don’t have another filter so you need to use a Common Table expression classify facilities by value so we want to classify facilities into equally sized groups of high average and low based on their revenue and the result you can see it here for each facility it’s classified as high average or low and the point is that we decid decided uh at the beginning that we want three groups and this is arbitrary we could have said we want two groups or five or six or seven and then but we have three and then all the facilities that we have are distributed equally Within These groups so because we have nine facilities we get uh three facilities within each group and I can already tell you that there is a spe special function that will do this for us so we will not go through the trouble of implementing this manually which could be pretty complex so I have copied here the code that allow allows me to get the total revenue for each facility and we have seen this code more than one time in past exercises so if you’re still not clear about how we can get to this point uh check out the the previous exercises so what we did in the previous exercise was rank the facilities based on the revenue and how we did that is that we took the ranking window function and then we def defined our window as order by Revenue descending except that we don’t have a revenue column here but we do have the logic to compute the revenue so we can just get this logic and paste it in here and when I run this I will get a rank for each of my facilities where the biggest Revenue gets rank one and then it goes up from there now the whole trick to solve this exercise is to replace the rank aggregation with a antile aggregation and provide here the number of groups in which we want to divide our facilities and if I run this you see that I get what I need the facilities have been equally distributed into three groups where group number one has the facilities with the highest revenue and then we have group number two and finally group number three which has the facilities with the lowest revenue and to see how this function works I will simply go to Google and say postest antile and the second link here is the postest documentation and this is the page for window functions so if I scroll down here I can see all of the functions that I can use in window functions and you will recognize some of our old friends here row number rank dance rank uh and here we have antile and what we see here is that antile returns an integer ranging from one to the argument value and the argument value is what we have here which is the number of buckets dividing the partition as equally as possible so we call the enti function and we provide how many buckets we want to divide our data into and then the function divides the data as equally as possible into our buckets and how will this division take place that depends on the definition of the window in this case we are ordering by Revenue descending and so this is how the ntile function works so we just need to clean this up a bit I will remove the revenue part because that’s not required from us and I will call this uh enti quite simply and now I need to add a label on top of this enti value as you can see in the results so to do that I will wrap this into a Common Table expression and when I have a common table expression I don’t need the ordering anymore and then I can select from the table that I have just defined and what what do I want to get from this table I want to get the name of the facility and then I want to get the enti value with a label on top of it so I will use a case when statement to assign this label so case when NTI equals 1 then I will have high when anti equals 2 then I will have average else I will have low uh and the case and call this revenue and finally I want to order by antile so the results are first showing High then average then low and also by facility name and I get the result that I wanted so to summarize uh this is just like the previous exercise except that we use a different window function because instead of rank we use end tile so that we can pocket our data and in the window like we said in the previous exercise there’s a few nested operations and you can figure it out by going to the deepest one and moving upwards so the first one picks up the guest cost or member cost multiplies it by slots gets the revenue Vue for each single booking the next one Aggregates on top of this within each facility so we get the total revenue by facility and then we use this we order by Revenue descending this defines our window and this is what the bucketing system uses to distribute the facilities uh in each bucket based on their revenue and then finally we need to add another layer of logic uh here we need to use a common table expression so that we can label our our percentile with the required um text labels calculate the payback time for each facility so this requires some understanding of the business reality that this data represents so if we look at the facilities table we have an initial outlay which represents the initial investment that was put into getting this facility and then we also have a value of monthly maintenance which is what we pay each month to keep this facility running and of course we will also have a value of monthly revenue for each facility so how can we calculate the amount of time that each facility will take to repay its cost of ownership let’s actually write it down so we don’t lose track of it we can get the monthly revenue of each facility but what we’re actually interested in is the monthly profit right um and to get the profit we can subtract the monthly maintenance for each facility so Revenue minus expenses equals profit and when we know how much profit we make for the facility each month we can take the initial investment and divided by the monthly profit and then we can see how many months it will take to repay the initial investment so let us do that now and what I have done here once again I copied the code to calculate the total revenue for each facility and um we have seen this in the previous exercises so you can check those out if you still have some questions about this and now that we have the total revenue for each facility we know that we have three complete months of data so far so how do we get to this to the monthly Revenue it’s as simple as dividing all of this by three and I will say 3.0 so we don’t have integer division but we have proper division you know and I can call this monthly revenue and now the revenue column does not exist anymore so I can remove the order buy and here I can see the monthly revenue for each facility and now from the monthly revenue for each facility I can subtract the monthly maintenance and this will give me the monthly profit but now we get this error and can you figure out what this is about monthly maintenance does not appear in the group by Clause so what we did here is that we grouped by facility name and then we selected that which is fine and all the rest was gation so remember as a rule when you Group by you can only select the columns that you have grouped by and aggregations and monthly maintenance uh is not an aggregation so in order to make it work we need to add it to the group by statement over here and now I get the monthly profit and finally the last step that I need to take is to take the initial outlay and divide it by by all of the rest that we have computed until now and we can call this months because this will give us the number of months that we need in order to repay our initial investment and again we get the same issue initial outlay is not an aggregation does not appear in the group by clause and easy solution we can just add it to the group by clause so something is pretty wrong here the values look pretty weird so looking at all this calculation that we have done until now can you figure out why the value is wrong the issue here is related to the order of operations because we have no round brackets here the order of operation will be the following initial outlay will be divided by the total revenue then it will be divided by 3.0 and then out of all of these we will subtract monthly maintenance but that’s not what we want to do right what we want to do is to take initial outlay and divide it by everything else which is the profit so I will add round brackets here and here and now we get something that makes much more sense because first we execute everything that’s Within These round brackets and we get the monthly profit and then all of it we divide initial outlay by and then what we want to do is to order by facility name so I will add it here and we get the result so quite a representative business problem calculating a revenue and profits and time to repay initial investment and uh overall is just a bunch of calculations starting from the group bu that allows us to get the total revenue for each booking we sum those revenues to get the total revenue for each facility divide by three to get the monthly Revenue subtract the monthly expenses to get the monthly profit and then take the initial investment and divide by the monthly profit and then we get the number of months that it will take to repay the facility calculate a rolling average of total revenue so for each day in August 2012 we want to see a rolling average of total revenue over the previous 15 days rolling averages are quite common in business analytics and how it works is that if you look at August 1st this value over here is the average of daily revenue for all facilities over the past 15 days including the day of August 1st and then this average is rolling by one win one day or sliding by one day every time so that the next average is the uh same one except it has shifted by one day because now it includes the 2nd of August so let’s see how to calculate this and in here I have basic code that calculates the revenue for each booking and I’ve taken this from previous exercises so if you have any questions uh check those out and what we have here is the name of each facility and um and the revenue for each booking so each row here represents just a single booking so this is what we had until now but if you think about it we’re not actually interested in seeing the name of the facility because we’re going to be uh summing up over all facilities we’re not interested in the revenue by each facility but we are interested in seeing the date in which each booking occurs because we want to aggregate within the date here so to get the date I can get the start part time field from bookings and because this is a time stamp so it shows hours minutes seconds I need to reduce it to a date and what I get here is that for again each row is a booking and for each booking I know the date on which it occurred and the revenue that it generated now for the next step I need to see the total revenue over each facility within the date right so this is a simple grouping so if I group by this calculation over here which gives me my date I can then get the date and now I have um I have compressed all the different occurrences of dates to Unique values right one row for every date and now I need to compress as well all these different revenues for each date to a single value and for that I can put this logic inside the sum aggregation as we have done before and this will give me the total revenue across all facilities for each day and we have it here for the next step my question for you is how can I see the global average over all revenues on each of these rows so that is a roow level aggregation that doesn’t change the structure of the table and that’s a window function right so I can have a window function here that gets the average of Revenue over and for now I can leave my window definition open because I will look at the whole table however um Revenue will not work because revenue is just a label that I’ve given on this column and but but this part here is not aware of the label I don’t actually have a revenue column at this point but instead of saying Revenue I could actually copy this logic over here and it would work because the window function occurs after Computing the aggregation so the window function is aware of it so this should work and now for every row I see the global average over all the revenues by day now for the next step I would like to first order by date ascending so we have it here in order and my next question of for you is how can we make this a cumulative average average so let’s say that our rows are already ordered by date and how can I get the average to grow by date so in the first case the average would be equal to the revenue because we only have one value on the second day the average would be the average of these two values so all the values we’ve seen until now on the third day it would be the average of the first three values and so on how can I do that the way that I can do that is that I can go to my window definition over here and I can add an ordering and I can order by date but of course the column date does not exist because that’s a label that will be assigned after all this part is done uh window function is not aware of labels but again window function works great with logic so I will take the logic and put it in here and now you can see that I get exactly what I wanted on the first row I get the average is equal to the revenue and then as it grows we only look at the current revenue and all the previous revenues to compute the average and but we don’t look at all of the revenues so on the second row uh we have the average between this Revenue over here and this one over here and then on the third row we have the average between these three revenues and so on now you will realize that we are almost done with our problem and the only piece that’s missing is that right now if I pick a random day within my data set say this one the the average here is computed over all the revenues from the previous days so all the days in my data that lead up to this one they get averaged and we compute this revenue and what I want to do to finish this problem is that instead of looking at all the days I only want to look 15 days back so I need to to reduce the maximum length that this window can extend in time from limited to 15 days back now here is where it gets interesting so what we need to do is to fine-tune the window definition in order to only look 15 days back and with window functions we do have the option to fine-tune the window and it turns out that there’s a another element to the window definition which is usually implicit it’s usually not written explicitly but it’s there in the background and it’s the rows part so I will now write rows between unbounded preceding and current rows row now what the rose part does is that it defines how far back the window can look and how far forward the the window can look and what we see in this command is actually the standard Behavior it’s the thing that happens by default which is why we usually don’t need to write it and what this means is that it says look as far back in the past as you can look as far back as you can based on the ordering and the current row so this is what we’ve been seeing until now and if I now run the query again after adding this part you will see that the values don’t change at all because this is what we have been doing until now so now instead of unbounded proceeding I want to look 14 rows back plus the current row which together makes 15 and if I run this my averages change because I’m now looking um I’m now averaging over the current row and the 14 previous rows so the last 15 values and now what’s left to do to match our result is to remove the actual Revenue over here and call this Revenue and finally we’re only interested in values for the month of August 2012 so we need to add a filter but we cannot add a filter in this table definition here because if we added a wear filter here um isolating the period for August 2012 can you see what the problem would be um if my data could only see um Revenue starting from the 1st of August he wouldn’t be able to compute the rolling average here because to get the rolling average for this value you need to look two weeks back and so you need to look into July so you need all the data to compute the rolling revenue and we must filter after getting our result so what that looks like is that we can wrap all of this into a Common Table expression and we can we won’t need the order within the Common Table expression anymore and then selecting this we can filter to make sure that the date fits in the required period so we could truncate this date at the month level and make sure that it is equal that the truncated value value is equal to the month of August and we have seen how day trunk works in the previous exercises and then we could select all of our columns and order by date I believe we may have an extra small error here because I kept the partial wear statement and if I run this I finally get the result that I wanted so a query that was a bit more complex it was the final boss of our exercises um so let’s summarize it we get the data we need by joining booking and facility um and then we are getting the revenue for each booking that is this um multiply slots by either guest cost or member cost cost depending on whether the member is a guest or not this is getting the revenue within each booking then we are grouping by date which you see uh over here and summing all of these revenues that we computed so that we get the total revenue within each day for all facilities then the total revenue for each day goes into a window function which computes an aggre ation at the level of each row and the window function computes the average for these total revenues within a specific window and the window is defines an ordering based on time so the the ordering based on date and the default behavior of the window would be to look at the average for the current day and all the days that precede up until the earliest date and we’re doing here is that we are fine-tuning the behavior of this function by saying hey don’t look all the way back in the past uh only look at 14 rows preceding plus the current row which means that given the time ordering we compute the average over the last 15 values of total revenue and then finally we wrap this in a Common Table expression and we filter so that we only see the rolling average for the month of August and we order by date and that were all the exercises that I wanted to do with you I hope you enjoyed it I hope you learned something new as you know there are more sections in here that go more into depth into date functions and string functions and how you can modify data I really think you can tackle those on your own these were the uh Essentials ones that I wanted to address and once again thank you to the author of this website aliser Owens who created this and made it available for free I did not create this website um so you can just go here and without signing up or paying anything you can just do these exercises my final advice for you don’t be afraid of repetition we live in the age of endless content so there’s always something new to do but there’s a lot of value to um repeating the same exercises over and over again when I Was preparing for interviews when I began as a date engineer I did these exercises and altogether I did them like maybe three or four times um and um I found that it was really helpful to do the same exercises over and over again because often I did not remember the solution and I had to think through it all over again and it strengthened those those uh those learning patterns for me so now that you’ve gone through all the exercises and seen my Solutions uh let it rest for a bit and then come back here and try to do them again I think it will be really beneficial in my course I start from the very Basics and I show you in depth how each of the SQL components work I um explore the logical order of of SQL operations and I spend a lot of time in Google Sheets um simulating SQL operations in the spreadsheet coloring cells moving them around making some drawings in excal draw uh so that I can help you understand in depth what’s happening and build those mental models for how SQL operations work this course was actually intended as a complement to that so be sure to check it out

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog

  • Ultimate Data Analyst Bootcamp SQL, Excel, Tableau, Power BI, Python, Azure

    Ultimate Data Analyst Bootcamp SQL, Excel, Tableau, Power BI, Python, Azure

    The provided text consists of excerpts from a tutorial series focusing on data cleaning and visualization techniques. One segment details importing and cleaning a “layoffs” dataset in MySQL, emphasizing best practices like creating staging tables to preserve raw data. Another section demonstrates data cleaning and pivot table creation in Excel, highlighting data standardization and duplicate removal. A final part showcases data visualization techniques in Tableau, including the use of bins, calculated fields, and various chart types.

    MySQL & Python Study Guide

    Quiz

    Instructions: Answer the following questions in 2-3 sentences each.

    1. In the MySQL setup, what is the purpose of the password configuration step?
    2. What is the function of the “local instance” in MySQL Workbench?
    3. How do you run SQL code in the query editor?
    4. Explain what the DISTINCT keyword does in SQL.
    5. Describe how comparison operators are used in the WHERE clause.
    6. What is the purpose of logical operators like AND and OR in a WHERE clause?
    7. Explain the difference between INNER JOIN, LEFT JOIN, and RIGHT JOIN.
    8. What is a self join and why would you use it?
    9. What does the CASE statement allow you to do in SQL queries?
    10. How does a subquery work in a WHERE clause?

    Quiz Answer Key

    1. The password configuration step is crucial for securing the MySQL server, ensuring that only authorized users can access and modify the database. It involves setting and confirming a password, safeguarding the system from unauthorized entry.
    2. The “local instance” in MySQL Workbench represents a connection to a database server that is installed and running directly on your computer. It allows you to interact with the database without connecting to an external server.
    3. To run SQL code in the query editor, you type your code in the editor window and then click the lightning bolt execute button. This will execute the code against the connected database and display the results in the output window.
    4. The DISTINCT keyword in SQL is used to select only the unique values from a specified column in a database table. It eliminates duplicate rows from the result set, showing only distinct or different values.
    5. Comparison operators in the WHERE clause, like =, >, <, >=, <=, and !=, are used to define conditions that filter rows based on the comparison between a column and a value or another column. These operators specify which rows will be included in the result set.
    6. Logical operators AND and OR combine multiple conditions in a WHERE clause to create more complex filter criteria. AND requires both conditions to be true, while OR requires at least one condition to be true.
    7. INNER JOIN returns only the rows that have matching values in both tables. LEFT JOIN returns all rows from the left table and matching rows from the right table (or null if no match). RIGHT JOIN returns all rows from the right table and matching rows from the left table (or null if no match).
    8. A self join is a join operation where a table is joined with itself. This can be useful when you need to compare rows within the same table, such as finding employees with a different employee ID, as shown in the secret santa example.
    9. The CASE statement in SQL allows for conditional logic in a query, enabling you to perform different actions or calculations based on specific conditions. It is useful for creating custom outputs such as salary raises based on different criteria.
    10. A subquery in a WHERE clause is a query nested inside another query, usually used to filter rows based on the results of the inner query. It allows you to perform complex filtering using a list of values derived from another query.

    Essay Questions

    Instructions: Answer the following questions in essay format.

    1. Describe the process of setting up a local MySQL server using MySQL Workbench. Include in your response the steps and purpose of each.
    2. Explain how to create a database and tables using a SQL script in MySQL Workbench. Detail the purpose of a script, and how it adds data into the tables.
    3. Compare and contrast the different types of SQL joins, illustrating with examples.
    4. Demonstrate your understanding of comparison operators, logical operators and the like statement and how they are used within the WHERE clause in SQL.
    5. Describe the purpose and functionality of both CASE statements and subqueries in SQL. How do these allow for complex data retrieval and transformation?

    Glossary of Key Terms

    • MySQL: A popular open-source relational database management system (RDBMS).
    • MySQL Workbench: A GUI application for administering MySQL databases, running SQL queries, and managing server configurations.
    • Local Instance: A database server running on the user’s local machine.
    • SQL (Structured Query Language): The standard language for managing and querying data in relational databases.
    • Query Editor: The area in MySQL Workbench where SQL code is written and executed.
    • Schema: A logical grouping of database objects like tables, views, and procedures.
    • Table: A structured collection of data organized into rows and columns.
    • View: A virtual table based on the result set of an SQL statement, useful for simplifying complex queries.
    • Procedure: A stored set of SQL statements that can be executed with a single call.
    • Function: A routine that performs a specific task and returns a value.
    • SELECT statement: The SQL command used to retrieve data from one or more tables.
    • WHERE clause: The SQL clause used to filter rows based on specified conditions.
    • Comparison Operator: Operators like =, >, <, >=, <=, and != used to compare values.
    • Logical Operator: Operators like AND, OR, and NOT used to combine or modify conditions.
    • DISTINCT keyword: Used to select only unique values in a result set.
    • LIKE statement: Used to search for patterns in a string.
    • JOIN: Used to combine rows from two or more tables based on a related column.
    • INNER JOIN: Returns only the rows that match in both tables.
    • LEFT JOIN: Returns all rows from the left table and matching rows from the right table.
    • RIGHT JOIN: Returns all rows from the right table and matching rows from the left table.
    • Self Join: A join where a table is joined with itself.
    • CASE statement: Allows for conditional logic within a SQL query.
    • Subquery: A query nested inside another query.
    • Pemos (PEMDAS): The order of operations for arithmetic or math within MySQL: Parentheses, Exponents, Multiplication and Division, Addition and Subtraction.
    • Integer: A whole number, positive or negative.
    • Float: A decimal number.
    • Complex Number: A number with a real and imaginary part.
    • Boolean: A data type with two values: True or False.
    • String: A sequence of characters.
    • List: A mutable sequence of items, enclosed in square brackets [].
    • Tuple: An immutable sequence of items, enclosed in parentheses ().
    • Set: An unordered collection of unique items, enclosed in curly braces {}.
    • Dictionary: A collection of key-value pairs, enclosed in curly braces {}.
    • Index (in Strings and Lists): The position of an item in a sequence. Starts at zero.
    • Append: A method to add an item to the end of a list.
    • Mutable: Able to be changed.
    • Immutable: Not able to be changed.
    • Del: Used to delete an item from a list.
    • Key (Dictionary): A unique identifier that maps to a specific value.
    • Value (Dictionary): The data associated with a specific key.
    • In: A membership operator to check if a value is within a string, list, etc.
    • Not In: The opposite of ‘in’, checks if a value is not within a string, list, etc.
    • If statement: A control flow statement that executes a block of code if a condition is true.
    • elif statement: A control flow statement that checks another condition if the preceding if condition is false.
    • else statement: A control flow statement that executes a block of code if all preceding if or elif conditions are false.
    • Nested if statement: An if statement inside another if statement.
    • For loop: A control flow statement that iterates through a sequence of items.
    • Nested for loop: A for loop inside another for loop.
    • while loop: A control flow statement that executes a block of code as long as a condition is true.
    • Break statement: Stops a loop, even if the while condition is true.
    • Function: A block of code that performs a specific task and can be reused.
    • Def: Keyword to define a function.
    • Arbitrary arguments: Used when the number of arguments passed into a function are not specified.
    • Keyword arguments: used when passing through a function, and explicitly naming the value of the parameter.
    • Arbitrary keyword arguments: Similar to an arbitrary argument but explicitly names the value and the parameter.
    • Pandas: A powerful Python library used for data manipulation and analysis.
    • DataFrame: A two-dimensional labeled data structure in Pandas, similar to a spreadsheet or SQL table.
    • Series: A one-dimensional labeled data structure in Pandas.
    • Import: A keyword used to bring in outside packages, libraries, and modules into the current code.
    • .Read_CSV(): The Pandas function that loads a CSV into a DataFrame.
    • .loc(): Pandas function that allows a value in the index to be called.
    • .iloc(): Pandas function that allows an integer location in the index to be called.
    • .sort_values(): Pandas function used to order data by a specific column or a list of columns.
    • .rename(): Pandas function that can rename column names.
    • .groupby(): Pandas function that can group all values by a specific column.
    • .reset_index(): Pandas function that converts an index back to a column.
    • .set_index(): Pandas function that creates a column to be an index.
    • .filter(): Pandas function that will take a specific column for a DataFrame based off a string.
    • .isin(): Pandas function that will look through a column to see if it contains specific values.
    • .str.contains(): Pandas function that will look through a column to see if it contains a specific string.
    • Axis: refers to the direction of an operation. 0 is for rows and 1 is for columns.
    • Multi-indexing: Setting more than one index to your pandas data frame.
    • .str.split(): Pandas function that splits a column string by a delimiter.
    • .str.replace(): Pandas function that replaces strings within a column with another string.
    • .fillna(): Pandas function that fills in any null values within a data frame.
    • .explode(): Pandas function that will duplicate rows when a specific column contains multiple values.
    • Azure Synapse Analytics: A limitless analytics service that enables data processing and storage within the Azure cloud.
    • SQL Pool: A SQL based service within Azure Synapse.
    • Spark Pool: A Python-based service within Azure Synapse.
    • Delimiter: A character or sequence of characters that separates values in a string.
    • Substring: A string within a string.
    • Seaborn: Python plotting library based on matplotlib that creates graphs with complex visualizations.
    • Matplotlib: Python plotting library that allows you to make basic graphs and charts.
    • Wild card: A symbol that acts like a placeholder and can substitute for a variety of different characters.
    • ETL: Extract Transform Load the process of using a data pipeline.
    • Data pipeline: The process that moves data through a database.

    SQL, Python, and Pandas Data Wrangling

    Okay, here is a detailed briefing document summarizing the provided sources.

    Briefing Document: MySQL, SQL Concepts, Python Data Types, and Data Manipulation

    Overview: This document consolidates information from various sources to provide a comprehensive overview of key concepts related to database management (MySQL), SQL query writing, fundamental Python data types and operations, and data manipulation techniques using pandas. It will be organized into the following sections:

    1. MySQL Setup and Basic Usage:
    • Initial configuration of MySQL server and related tools.
    • Creation of databases and tables.
    • Introduction to SQL query writing.
    • Saving and loading SQL code.
    1. SQL Query Writing and Data Filtering:
    • Using the SELECT statement to retrieve and manipulate columns.
    • Applying the WHERE clause to filter rows.
    • Utilizing comparison and logical operators within WHERE clauses.
    • Working with LIKE statements for pattern matching.
    1. SQL Joins and Data Combination:
    • Understanding inner joins, left joins, right joins, and self joins.
    • Combining data from multiple tables based on matching columns.
    1. SQL Functions and Subqueries
    • Using Case statements for conditional logic.
    • Understanding and applying subqueries in various contexts (WHERE, SELECT, FROM).
    • Using aggregation functions with group by
    • Understanding window functions
    1. Python Data Types and Operations:
    • Overview of numeric, boolean, and sequence data types (strings, lists, tuples, sets, dictionaries).
    • String manipulation techniques.
    • List manipulation techniques.
    • Introduction to sets and dictionaries.
    1. Python Operators, Control Flow, and Functions:
    • Using comparison, logical, and membership operators in python.
    • Understanding and using conditional statements (if, elif, else).
    • Implementing for and while loops.
    • Creating and using functions, with an understanding of different argument types.
    1. Pandas Data Manipulation and Visualization:
    • Data loading into pandas dataframes.
    • Filtering, sorting, and manipulating data in a dataframe
    • Working with indexes and multi-indexes
    • Cleaning data using functions such as replace, fillna, and split.
    • Basic data visualizations.

    Detailed Breakdown:

    1. MySQL Setup and Basic Usage:

    • The source demonstrates the setup process of MySQL, including password creation, and configuration as a Windows service.
    • “I’m just going to go ahead and create a password now for you and I keep getting this error and I can’t explain why right here for you you should be creating a password at the bottom…”
    • The tutorial covers setting up sample databases and launching MySQL Workbench.
    • It showcases connecting to a local instance and opening an SQL script file for database creation.
    • The process of creating a “Parks and Recreation” database using an SQL script is outlined:
    • “Now what I’m going to do is I’m going to go ahead and I’m going to say open a SQL script file in a new query Tab and right here it opened up to a folder that I already created this my SQL beginner series folder within it we have this right here the Parks and Rec creat _ DB…”
    • The script creates tables and inserts data, showcasing fundamental SQL operations.
    • Running code with the lightning bolt button to execute SQL scripts, and refreshing the schema with the refresh button.

    2. SQL Query Writing and Data Filtering:

    • The source introduces the SELECT statement, showing how to select specific columns.
    • “The first thing that we’re going to click on is right over here this is our local instance this is local to just our machine it’s not a connection to you know some other database on the cloud or anything like that it’s just our local instance…”
    • It demonstrates how to format SQL code for readability, including splitting SELECT statements across multiple rows.
    • “typically can be easier to read also if you’re doing any type of functions or calculations in the select statement it’s easier to separate those out on its individual row.”
    • The use of calculations in SELECT statements and how MySQL follows the order of operations (PEMDAS) is shown.
    • “now something really important to know about any type of calculations any math within my SQL is that it follows the rules of pemos now pemos is written like this it’s pmde s now what I just did right here with this pound or this hashtag is actually create a comment…”
    • The DISTINCT keyword is explained and demonstrated, showing how to select unique values within a column or combinations of columns.
    • “what distinct is going to do is it’s going to select only the unique values within a column…”
    • The WHERE clause is explored for filtering data.
    • “hello everybody in this lesson we’re going to be taking a look at the wear Clause the wear Clause is used to help filter our records or our rows of data…”
    • Comparison operators (equal, greater than, less than, not equal) are discussed and exemplified with various data types (integers, strings, dates).
    • Logical operators (AND, OR, NOT) are introduced and how they can be combined to create complex conditional statements in WHERE clauses.
    • The LIKE operator is introduced to search for specific patterns.

    3. SQL Joins and Data Combination:

    • The concepts of inner joins, left joins, right joins, and self-joins are introduced.
    • Inner joins are demonstrated for combining data from two tables with matching columns.
    • “An inner join is basically when you tie two tables together and it only returns records or rows that have matching values in both tables…”
    • Left joins and right joins are compared to include all rows from one table and only matching rows from the other, and that it populates nulls for the mismatched data.
    • “A left join is going to take everything from the left table even if there’s no match in the join and then it will only return the matches from the right table the exact opposite is true for a right join…”
    • Self joins are explained and demonstrated, including how a use case for secret Santa assignments can be done using SQL self-joins.
    • “now what is a self jooin it is a join where you tie the table to itself now why would you want to do this let’s take a look at a very serious use case…”
    • Aliases for tables are used to avoid ambiguity when joining tables that have similar columns.
    • “So in our field list which is right up here in the select statement we have this employee ID it does not know which employee ID to pull from whether it’s the demographics or the salary so we have to tell it which one to pull from so let’s pull it from the demographics by saying dm. employee ID…”

    4. SQL Functions and Subqueries:

    • The use of CASE statements for conditional logic in queries is covered to derive new columns and create custom business logic.
    • “these are the guidelines that the pony Council sent out and it is our job to determine and figure out those pay increases as well as the bonuses…”
    • Subqueries are introduced as a means to nest queries and further filter data.
    • “now subquery is basically just a query within another query…”
    • Subqueries in WHERE clauses, SELECT statements, and FROM clauses are demonstrated through various examples.
    • “we want to say where the employee undor ID that’s referencing this column in the demographics table is in what we’re going to do is we’re going to do a parenthesis here and we can even come down and put a parenthesis down here so what we’re going to do now is write our query which is our subquery and this is our outer query…”
    • The use of group by and aggregate functions is shown.
    • “if we’re going to say group by and then we’ll do department ID that’s how we’ll know which one to to group this by…”
    • The use of window functions are shown.
    • “Window functions work in a way that when you have an aggregation you’re now creating a new column based off of that aggregation but you’re including the rows that were not in the group by…”

    5. Python Data Types and Operations:

    • Numeric data types (integers, floats, complex numbers) are defined and illustrated.
    • “There are three different types of numeric data types we have integers float and complex numbers let’s take a look at integers…”
    • Boolean data types (True and False) and their use in comparisons are demonstrated.
    • Sequence data types such as strings are introduced.
    • “in Python strings are arrays of bytes representing Unicode characters…”
    • String indexing, slicing, and multiplication are demonstrated.
    • Lists as mutable collections of multiple values are discussed.
    • List indexing and the append method are shown.
    • Nested lists are also shown.
    • Tuples as immutable collections and their differences from lists are explained.
    • “a list and a tupal are actually quite similar but the biggest difference between a list and a tupal is that a tupal is something called immutable…”
    • Sets as unordered collections with no duplicates are shown.
    • “a set is somewhat similar to a list and a tuple but they are a little bit different in fact that they don’t have any duplicate elements…”
    • Dictionaries as key-value pairs for storing data are explained.
    • “A dictionary is basically used to store data values in key value pairs…”

    6. Python Operators, Control Flow, and Functions:

    • Comparison operators, their purpose, and examples are shown.
    • “operators are used to perform operations on variables and values for example you’re often going to want to compare two separate values to see if they are the same or if they’re different within Python…”
    • Logical operators are defined and illustrated with examples.
    • Membership operators (in, not in) and their purpose is shown.
    • Conditional statements (if, elif, else) are introduced and used with various logical and comparison operators.
    • “today we’re going to be taking a look at the if statement within python…”
    • For and while loops are explained along with the break statement to halt loops.
    • “today we’re going to be taking a look at while Loops in Python the while loop in Python is used to iterate over a block of code as long as the test condition is true…”
    • Functions are introduced and how to create functions using parameters is shown.
    • “today we’re going to be taking a look at functions in Python functions are basically a block of code that only runs when it is called…”
    • The concept of an arbitrary argument is introduced for functions, as well as keyword arguments.

    7. Pandas Data Manipulation and Visualization:

    • Data loading into pandas dataframes and the use of read.csv function.
    • Filtering based off of columns using loc and iloc is shown.
    • “there’s two different ways that you can do that at least this is a very common way that people who use pandas will do to kind of search through that index the first one is called lock and there’s lock and ick…”
    • Filtering using is_in and contains methods.
    • Data sorting and ordering using sort_values.
    • “now we can sort and order these values instead of it just being kind of a jumbled mess in here we can sort these columns however we would like ascending descending ing multiple columns single columns…”
    • Working with indexes and multi-indexes in pandas dataframes.
    • *”multi- indexing is creating multiple indexes we’re not just going to create the country as the index now we’re going to add an additional index on top of that…”*
    • Cleaning columns using functions such as split, replace, and fillna.
    • *”we want to split on this column and then we’ll be able to create three separate columns based off of this one column which is exactly what we want…”*
    • Basic data visualizations with seaborn
    • “we’re going to import Seaborn as SNS and if we need to um we’re going to import map plot lib as well I don’t know if we’ll use it right now or at all but um we’re going to we’re going to add it in here either way…”

    Conclusion: These sources provide a foundational understanding of SQL, MySQL, Python data types, and pandas, covering the basics needed to perform common data tasks. They should provide a strong basis for continuing further learning.

    Essential SQL: A Beginner’s Guide

    8 Question FAQ:

    1. How do I set up a local MySQL server and create a database? To set up a local MySQL server, you’ll typically download and install the MySQL server software for your operating system. During the installation process, you’ll be prompted to create a root password, and configure MySQL as a Windows service if you’re on Windows. It is best practice to set MySQL to start at system startup for convenience. Once the server is configured, you can use MySQL Workbench or a similar tool to connect to your local server. To create a database, you can execute SQL code to create the database and its tables. You can either write this code yourself, or import it as an SQL script. This script will contain CREATE DATABASE, CREATE TABLE, and INSERT statements to build your database and populate it with initial data.
    2. What is the purpose of a SQL query editor and how do I use it? A SQL query editor is a tool that allows you to write and execute SQL code against your database. You can use a query editor to create, modify, and retrieve data from your database. In MySQL Workbench, the query editor is typically a text area where you can type your SQL code. You can also open a file containing SQL code. After typing or importing your SQL code, you can execute it by clicking a run button (usually a lightning bolt icon) or pressing a hotkey. The results of your query will typically be displayed in an output window or a separate pane within the query editor.
    3. What is a SELECT statement in SQL, and how can I use it to retrieve data? A SELECT statement is used to retrieve data from one or more tables in a database. You specify which columns to retrieve with the SELECT keyword followed by a list of columns (or an asterisk * for all columns) and then the table from which you are selecting. It has the following structure: SELECT column1, column2 FROM table_name;. You can use commas to separate out multiple column names, and it is best practice to write a comma after each column name and put it on an individual line, especially when making functions or calculations within the select statement. Additionally, you can perform calculations in your SELECT statement such as adding 10 years to an age field age + 10, and also use aliases like AS to name those columns.
    4. What are comments in SQL, and how can they be used? Comments in SQL are used to add notes and explanations to your SQL code. They are ignored by the database engine when executing the code. Comments can be used for documentation, debugging, and explanation purposes. Comments in SQL are denoted in various ways depending on the specific engine, however MySQL uses the pound or hashtag symbol # to comment out code on a single line. You can also use — before the line you wish to comment out. Comments help make your code more readable and easier to understand for yourself and other users of the database.
    5. What is the DISTINCT keyword in SQL, and what is its use? The DISTINCT keyword is used in a SELECT statement to retrieve only unique values from one or more columns. It eliminates duplicate rows from the result set. When you use DISTINCT with a single column, you’ll get a list of each unique value in that column. If you use it with multiple columns, you’ll get a list of rows where the combination of values in those columns is unique. For example SELECT DISTINCT gender FROM employee_demographics; will return the two unique values in the gender column.
    6. How can I use the WHERE clause to filter data in SQL, and what operators can I use? The WHERE clause is used in a SELECT statement to filter the data based on specific conditions. It only returns rows that match the criteria specified in the WHERE clause. You can use various comparison operators within the WHERE clause, such as =, >, <, >=, <=, and != (not equal). You can also use logical operators like AND, OR, and NOT to combine multiple conditions. For example, SELECT * FROM employee_demographics WHERE gender = ‘female’ will return all female employees, or, with AND or OR operators, you can filter based on multiple conditions, like WHERE birth_date > ‘1985-01-01’ AND gender = ‘male’ which would return all male employees born after 1985.
    7. How do logical operators like AND, OR, and NOT work in conjunction with the WHERE clause and what is PEMDAS? Logical operators such as AND, OR, and NOT combine multiple conditions within a WHERE clause and can be applied to math operations as well as other types of operators. AND requires both conditions to be true to return a row. OR requires at least one of the conditions to be true. NOT negates a condition which makes a true statement false and a false statement true. The WHERE clause also has something called PEMDAS, which stands for the order of operations and dictates how mathematical calculations or logical statements are performed. PEMDAS (Parentheses, Exponents, Multiplication, Division, Addition, Subtraction) is a mathematical order of operations and that same logic also applies to the WHERE clause. For example, a statement like WHERE (first_name = ‘Leslie’ AND age = 44) OR age > 55 will return results based on the grouped parentheses and then will consider the outside condition based on the OR operator.
    8. What is the LIKE operator in SQL, and how can I use it for pattern matching? The LIKE operator is used in a WHERE clause for pattern matching with wildcards. You don’t have to have an exact match when using the LIKE operator. The percent sign % is used as a wildcard to match zero or more characters, and the underscore _ is used to match a single character. For instance, SELECT * FROM employee_demographics WHERE first_name LIKE ‘L%’ will return employees with first names starting with “L”. Or, SELECT * FROM employee_demographics WHERE first_name LIKE ‘L_s%’ returns first names that start with “L”, then one character, and then an “s”. The LIKE operator is very helpful when you don’t know exactly what the values in a field will be and you just want to query values based on patterns.

    Data Import and Transformation Methods

    Data can be imported into various platforms for analysis and visualization, as described in the sources. Here’s a breakdown of the import processes discussed:

    • MySQL: Data can be imported into MySQL using a browse function, and a new table can be created for the imported data [1]. MySQL automatically assigns data types based on the column data [1]. However, data types can be modified, such as changing a text-based date column to a date/time format [1].
    • Power BI:Data can be imported from various sources including Excel, databases, and cloud storage [2].
    • When importing from Excel, users can choose specific sheets to import [2].
    • Power Query is used to transform the data, which includes steps to rename columns, filter data, and apply other transformations [2, 3].
    • After transformation, the data can be loaded into Power BI Desktop [2].
    • Data can also be imported by using the “Get Data” option which will bring up several different options for the user to select from, including databases, blob storages, SQL databases, and Google Analytics [2].
    • Multiple tables or Excel sheets can be joined together in Power BI, using the “Model” tab [2].
    • Azure Data Factory: Data from a SQL database can be copied to Azure Blob storage. This involves selecting the source (SQL database) and destination (Azure Blob storage), configuring the file format (e.g., CSV), and setting up a pipeline to automate the process [4].
    • Azure Synapse Analytics:Data can be imported from various sources, including Azure Blob Storage [5].
    • Data flows in Azure Synapse Analytics allow users to transform and combine data from different sources [5].
    • The copy data tool can be used to copy data from blob storage to another location, such as a different blob storage or an Azure SQL database [6].
    • Amazon Athena:Amazon Athena queries data directly from S3 buckets without needing to load data into a database [7].
    • To import data, a table needs to be created, specifying the S3 bucket location, the data format (e.g., CSV), and the column details [7].
    • Crawlers can be used to automate the process of inferring the data schema from a data source, such as an S3 bucket [7].
    • AWS Glue Data Brew: AWS Glue Data Brew is a visual data preparation tool where data sets can be imported and transformed. Sample projects can also be created and modified for practice [8].

    In several of the tools described, there are options to transform data as part of the import process, which is a crucial step in data analysis workflows.

    Data Cleaning Techniques Across Platforms

    Data cleaning is a crucial step in preparing data for analysis and involves several techniques to ensure data accuracy, consistency, and usability. The sources describe various methods and tools for cleaning data, with specific examples for different platforms.

    General Data Cleaning Steps

    • Removing Duplicates: This involves identifying and removing duplicate records to avoid redundancy in analysis. In SQL, this can be done by creating a temporary column, identifying duplicates, and then deleting them [1, 2]. In Excel, there is a “remove duplicates” function to easily remove duplicates [3].
    • Standardizing Data: This step focuses on ensuring consistency in the data. It includes fixing spelling errors, standardizing formatting (e.g., capitalization, spacing), and unifying different representations of the same data (e.g., “crypto,” “cryptocurrency”) [1, 2, 4]. In SQL, functions like TRIM can be used to remove extra spaces, and UPDATE statements can standardize data [2]. In Excel, find and replace functions can be used to standardize the data [3].
    • Handling Null and Blank Values: This involves identifying and addressing missing data. Depending on the context, null or blank values may be populated using available information, or the rows may be removed, if the data is deemed unreliable [1, 2].
    • Removing Unnecessary Columns/Rows: This step focuses on removing irrelevant data, whether columns or rows, to streamline the data set and improve processing time. However, it’s often best practice to create a staging table to avoid making changes to the raw data [1].
    • Data Type Validation: Ensure that the data types of columns are correct. For example, date columns should be in a date/time format, and numerical columns should not contain text. This ensures that the data is in the correct format for any analysis [1, 4].

    Platform-Specific Data Cleaning Techniques

    • SQL:Creating staging tables: To avoid altering raw data, a copy of the raw data can be inserted into a staging table and the cleaning operations can be performed on that copy [1].
    • Removing duplicate rows: A temporary column can be added to identify duplicates based on multiple columns [2]. Then, a DELETE statement can be used to remove the identified duplicates.
    • Standardizing data: The TRIM function can be used to remove extra spaces, and UPDATE statements with WHERE clauses are used to correct errors [2].
    • Removing columns: The ALTER TABLE command can be used to drop a column [5].
    • Filtering rows: The DELETE command can be used to remove rows that do not meet certain criteria (e.g., those with null values in certain columns) [5].
    • Excel:Removing duplicates: The “Remove Duplicates” feature removes rows with duplicate values [3].
    • Standardizing formatting: Find and replace can standardize capitalization, and “Text to Columns” can split data into multiple columns [3, 4].
    • Trimming spaces: Extra spaces can be removed with the trim function [2].
    • Data Validation: You can use data validation tools to limit the type of data that can be entered into a cell, which helps in maintaining clean data.
    • Using formulas for cleaning: Logical formulas like IF statements can create new columns based on conditions that you set [3].
    • Power BI:Power Query Editor: Power Query is used to clean and transform data. This includes removing columns, filtering rows, changing data types, and replacing values.
    • Creating Calculated Columns: New columns can be created using formulas (DAX) to perform calculations or derive new data from existing columns.
    • Python (Pandas):Dropping duplicates: The drop_duplicates() function removes duplicate rows [6].
    • Handling missing values: The .isnull() and .fillna() functions are used to identify and handle null values [7].
    • String manipulation: String methods such as .strip() and .replace() are used to standardize text data [8].
    • Data type conversion: The .astype() function can convert data to appropriate types such as integers, floats, or datetime [8].
    • Sorting values: The .sort_values() function can sort data based on one or more columns [7].
    • AWS Glue Data Brew: Data Brew is a visual data preparation tool that offers a user-friendly interface for data cleaning.
    • Visual Transformation: Allows visual application of transformations, such as filters, sorts, and grouping, using a drag-and-drop interface [9].
    • Recipes: Creates and saves a recipe of all data cleaning steps, which can be re-used for other datasets [9].
    • Filtering Data: Data can be filtered using conditions (e.g., gender equals male) [9, 10].
    • Grouping and Aggregation: Data can be grouped on one or more columns to aggregate values (e.g., counts), and the results can be sorted to identify key trends in the data [10].
    • Sample Data: Users can test their cleaning steps on a sample of the data before running it on the full dataset [9, 10].

    In summary, the specific methods and tools used for data cleaning depend on the platform, data type, and specific requirements of the analysis. However, the general concepts of removing duplicates, standardizing data, and handling missing values apply across all platforms.

    Data Deduplication in SQL, Excel, and Python

    Duplicate removal is a key step in data cleaning, ensuring that each record is unique and avoiding skewed analysis due to redundant information [1-3]. The sources discuss several methods for identifying and removing duplicates across different platforms, including SQL, Excel, and Python [1-3].

    Here’s an overview of how duplicate removal is handled in the sources:

    SQL

    • Identifying Duplicates: SQL requires a step to first identify duplicate rows [4]. This can be achieved by using functions such as ROW_NUMBER() to assign a unique number to each row based on a specified partition [4]. The partition is defined by the columns that should be considered when determining duplicates [4].
    • Removing Duplicates: Once the duplicates have been identified (e.g., by filtering for rows where ROW_NUMBER() is greater than 1), they can be removed. Because you can’t directly update a CTE (Common Table Expression), this is often done by creating a staging table [4]. Then, the duplicate rows can be filtered and removed from the staging table [4].

    Excel

    • Built-in Functionality: Excel offers a built-in “Remove Duplicates” feature located in the “Data” tab [2]. This feature allows users to quickly remove duplicate rows based on selected columns [2].
    • Highlighting Duplicates: Conditional formatting can be used to highlight duplicate values in a data set [5]. You can sort by the highlighted color to bring duplicates to the top of your data set, then remove them [5].

    Python (Pandas)

    • drop_duplicates() Function: Pandas provides a straightforward way to remove duplicate rows using the drop_duplicates() function [3]. This function can remove duplicates based on all columns, or based on a subset of columns [3].

    Key Considerations

    • Unique Identifiers: The presence of a unique identifier column (e.g., a customer ID) can greatly simplify the process of identifying and removing duplicates [4, 5].
    • Multiple Columns: When determining duplicates, it may be necessary to consider multiple columns [4]. This is important if no single column is sufficient for identifying unique records [4].
    • Data Integrity: It’s important to be careful when removing duplicates, as it can alter your dataset if not handled correctly. Creating a backup or working on a copy is generally recommended before removing any duplicates [1].
    • Real-World Data: In real-world datasets with many columns and rows, identifying duplicates can be challenging [2, 3]. Automated tools and techniques, like those described above, are crucial to handling large datasets [2, 3].

    In summary, while the specific tools and syntax differ, the goal of duplicate removal is consistent across SQL, Excel, and Python: to ensure data quality and prevent skewed results due to redundant data [1-3]. Each of these platforms provides effective ways to manage and eliminate duplicate records.

    Data Analysis Techniques and Tools

    Data analysis involves exploring, transforming, and interpreting data to extract meaningful insights, identify patterns, and support decision-making [1-18]. The sources describe various techniques, tools, and platforms used for this process, and include details on how to perform analysis using SQL, Excel, Python, and business intelligence tools.

    Key Concepts and Techniques

    • Exploratory Data Analysis (EDA): EDA is a critical initial step in which data is examined to understand its characteristics, identify patterns, and discover anomalies [2, 10]. This process often involves:
    • Data Visualization: Using charts, graphs, and other visual aids to identify trends, patterns, and outliers in the data. Tools such as Tableau, Power BI, and QuickSight are commonly used for this [1, 3, 6, 8, 18].
    • Summary Statistics: Computing measures such as mean, median, standard deviation, and percentiles to describe the central tendency and distribution of the data [10].
    • Data Grouping and Aggregation: Combining data based on common attributes and applying aggregation functions (e.g., sum, count, average) to produce summary measures for different groups [2, 13].
    • Identifying Outliers: Locating data points that deviate significantly from the rest of the data, which may indicate errors or require further investigation [10]. Box plots can be used to visually identify outliers [10].
    • Data Transformation: This step involves modifying data to make it suitable for analysis [1, 2, 6, 7, 10, 13, 16, 17]. This can include:
    • Data Cleaning: Addressing missing values, removing duplicates, correcting errors, and standardizing data formats [1-8, 10, 11, 16, 17].
    • Data Normalization: Adjusting values to a common scale to make comparisons easier [8, 16].
    • Feature Engineering: Creating new variables from existing data to improve analysis [10]. This can involve using calculated fields [3].
    • Data Type Conversions: Ensuring that columns have the correct data types (e.g., converting text to numbers or dates) [2, 4, 10].
    • Data Querying: Using query languages (e.g., SQL) to extract relevant data from databases and data warehouses [1, 11-14].
    • Filtering: Selecting rows that meet specified criteria [1, 11].
    • Joining Data: Combining data from multiple tables based on common columns [2, 5, 9].
    • Aggregating Data: Performing calculations on groups of data (e.g., using GROUP BY and aggregate functions) [2, 13, 14].
    • Window Functions: Performing calculations across a set of rows that are related to the current row, which are useful for tasks like comparing consecutive values [11].
    • Statistical Analysis: Applying statistical techniques to test hypotheses and draw inferences from data [10].
    • Regression Analysis: Examining the relationships between variables to make predictions [10].
    • Correlation Analysis: Measuring the degree to which two or more variables tend to vary together [10].
    • Data Modeling: Creating representations of data structures and relationships to support data analysis and reporting [5, 11].
    • Data Interpretation: Drawing conclusions from the analysis and communicating findings effectively using visualizations and reports [3, 6, 8, 18].

    Tools and Platforms

    The sources describe multiple tools and platforms that support different types of data analysis:

    • SQL: Used for data querying, transformation, and analysis within databases. SQL is particularly useful for extracting and aggregating data from relational databases and data warehouses [1, 2, 11-14].
    • Excel: A versatile tool for data manipulation, analysis, and visualization, particularly for smaller datasets [2, 4, 6-8].
    • Python (Pandas): A programming language that offers powerful libraries for data manipulation, transformation, and analysis. Pandas provides data structures and functions for working with structured data [1, 4, 9, 10].
    • Tableau: A business intelligence (BI) tool for creating interactive data visualizations and dashboards [1, 3].
    • Power BI: Another BI tool for visualizing and analyzing data, often used for creating reports and dashboards [1, 5, 6]. Power BI also includes Power Query for data transformation [5].
    • QuickSight: A cloud-based data visualization service provided by AWS [18].
    • Azure Synapse Analytics: A platform that integrates data warehousing and big data analytics. It provides tools for querying, transforming, and analyzing data [1, 12].
    • AWS Glue: A cloud-based ETL service that can be used to prepare and transform data for analysis [15, 17].
    • Amazon Athena: A serverless query service that enables you to analyze data in S3 using SQL [1, 14].

    Specific Analysis Examples

    • Analyzing sales data to identify trends and patterns [3].
    • Analyzing survey data to determine customer satisfaction and preferences [6, 7].
    • Analyzing geographical data by creating maps [3].
    • Analyzing text data to identify keywords and themes [4, 10].
    • Analyzing video game sales by year ranges and percentages [3].
    • Analyzing Airbnb data to understand pricing, location and review information [4].

    Considerations for Effective Data Analysis

    • Data Quality: Clean and accurate data is essential for reliable analysis [2, 4-7, 10, 11, 16, 17].
    • Data Understanding: A thorough understanding of the data and its limitations is crucial [4].
    • Appropriate Techniques: Selecting the right analytical methods and tools to address the specific questions being asked is important.
    • Clear Communication: Effectively communicating findings through visualizations and reports is a critical component of data analysis.
    • Iterative Process: Data analysis is often an iterative process that may involve going back and forth between different steps to refine the analysis and insights.

    In summary, data analysis is a multifaceted process that involves a variety of techniques, tools, and platforms. The specific methods used depend on the data, the questions being asked, and the goals of the analysis. A combination of technical skills, analytical thinking, and effective communication is needed to produce meaningful insights from data.

    Data Visualization Techniques and Tools

    Data visualization is the graphical representation of information and data, and is a key component of data analysis that helps in understanding trends, patterns, and outliers in data [1]. The sources describe various visualization types and tools used for creating effective data visualizations.

    Key Concepts and Techniques

    • Purpose: The primary goal of data visualization is to communicate complex information clearly and efficiently, making it easier for the user to draw insights and make informed decisions [1].
    • Chart Selection: Choosing the correct type of visualization is crucial, as different charts are suited to different kinds of data and analysis goals [1].
    • Bar Charts and Column Charts: These are used for comparing categorical data, with bar charts displaying horizontal bars and column charts displaying vertical columns [1, 2]. Stacked bar and column charts are useful for showing parts of a whole [2].
    • Line Charts: These are ideal for showing trends over time or continuous data [2, 3].
    • Scatter Plots: Scatter plots are used to explore the relationship between two numerical variables by plotting data points on a graph [2-4].
    • Histograms: These charts are useful for displaying the distribution of numerical variables, showing how frequently different values occur within a dataset [4].
    • Pie Charts and Donut Charts: Pie and donut charts are useful for showing parts of a whole, but it can be difficult to compare the sizes of slices when there are many categories [2, 5].
    • Tree Maps: Tree maps display hierarchical data as a set of nested rectangles, where the size of each rectangle corresponds to a value [2].
    • Area Charts: Area charts are similar to line charts but fill the area below the line, which can be useful for emphasizing the magnitude of change [2, 5].
    • Combination Charts: Combining different chart types (e.g., line and bar charts) can be effective for showing multiple aspects of the same data [2].
    • Gauges: Gauge charts are useful for displaying progress toward a goal or a single key performance indicator (KPI) [6].
    • Color Coding: Using color effectively to highlight different data categories or to show the magnitude of data. In line graphs, different colors can represent different data series [3].
    • Data Labels: Adding data labels to charts to make the data values more explicit and easy to read, which can improve the clarity of a visualization [2, 3].
    • Interactive Elements: Including interactive features such as filters, drill-downs, and tooltips can provide more options for exploration and deeper insights [2, 3, 7].
    • Drill-Downs: These allow users to explore data at multiple levels of detail, by clicking on one level of the visualization to see the next level down in the hierarchy [7].
    • Filters: Filters allow users to view specific subsets of data, and are useful when working with client facing work [3].
    • Titles and Labels: Adding clear titles and axis labels to visualizations is essential for conveying what is being shown [2, 8].

    Tools and Platforms

    The sources describe a range of tools used to create data visualizations:

    • Tableau: A business intelligence (BI) tool designed for creating interactive data visualizations and dashboards [1].
    • Power BI: A business analytics tool from Microsoft that offers similar capabilities to Tableau for creating visualizations and dashboards [1]. Power BI also has a feature called “conditional formatting” which allows the user to visually display data using things like color and data bars [9].
    • QuickSight: A cloud-based data visualization service offered by AWS, suitable for creating dashboards and visualizations for various data sources [1, 10].
    • Excel: A tool with built-in charting features for creating basic charts and graphs [1].
    • Python (Pandas, Matplotlib): Python libraries like pandas and matplotlib allow for creating visualizations programmatically [4, 5, 11].
    • Azure Synapse Analytics: This platform offers data visualization options that are integrated with its data warehousing and big data analytics capabilities, so you can visualize your data alongside other tasks [12].

    Specific Techniques

    • Marks: These refer to visual elements in charts such as color, size, text, and detail, that can be changed to add information to visualizations [3]. For example, color can be used to represent different categories, while size can be used to represent values.
    • Bins: Bins are groupings or ranges of numerical values used to create histograms and other charts, which can show the distribution of values [1, 3].
    • Calculated Fields: Calculated fields can be used to create new data fields from existing data, enabling more flexible analysis and visualization [3]. These fields can use operators and functions to derive values from existing columns [1].
    • Conditional Formatting: This technique can be used to apply formatting styles (e.g., colors, icons, data bars) based on the values in the cells of a table. This can be useful for highlighting key trends in your data [9].
    • Drill-downs: These are used to provide additional context and granularity to your visualizations and allow users to look into the next layer of the data [7].
    • Lists: Lists can be used to group together various data points for analysis, which can be visualized within a report or table [2].

    Best Practices

    • Simplicity: Simple, clear visualizations are more effective than complex ones. It’s best to avoid clutter and make sure that the visualization focuses on a single message [9].
    • Context: Visualizations should provide sufficient context to help users understand the data, including axis labels, titles, and legends [2, 3].
    • Appropriate Chart Type: Select the most suitable chart for the type of data being displayed [1].
    • Interactivity: Include interactive elements such as filters and drill-downs to allow users to explore the data at different levels [7].
    • Accessibility: Ensure that visualizations are accessible, including appropriate color choices and sufficient text labels [3, 9].
    • Audience: The intended audience and purpose of the visualization should also be taken into account [3].

    In summary, data visualization is a critical aspect of data analysis that involves using charts, graphs, and other visual aids to convey information effectively. By selecting appropriate chart types, incorporating interactive elements, and following best practices for design, data professionals can create compelling visualizations that facilitate insights and inform decision-making [1].

    Ultimate Data Analyst Bootcamp [24 Hours!] for FREE | SQL, Excel, Tableau, Power BI, Python, Azure

    By Amjad Izhar
    Contact: amjad.izhar@gmail.com
    https://amjadizhar.blog