Pandas: Data Manipulation, Filtering, Indexing, and Grouping Essentials

The source material presents a comprehensive guide to using the Pandas library in Python. It covers fundamental concepts like importing data from various file formats (CSV, text, JSON, Excel) into dataframes. The video provides instruction on cleaning, filtering, sorting, and indexing data. Also, it highlights the group by function, merging dataframes, and creating visualizations. The guide also teaches how to conduct exploratory data analysis, identifying patterns and outliers within a dataset.

Pandas Data Manipulation: A Comprehensive Study Guide

I. Quiz

Answer the following questions in 2-3 sentences each.

  1. What is a Pandas DataFrame, and why is the index important?
  2. Explain how to read a CSV file into a Pandas DataFrame, including handling potential Unicode errors.
  3. Describe how to read a text file into a Pandas DataFrame using read_table and specify a separator.
  4. How can you specify column names when reading a CSV file if the file doesn’t have headers?
  5. Explain how to filter a Pandas DataFrame based on values in a specific column.
  6. Describe the difference between loc and iloc when filtering data in a Pandas DataFrame using the index.
  7. Explain how to sort a Pandas DataFrame by multiple columns, specifying the sorting order for each.
  8. How do you create a MultiIndex in a Pandas DataFrame, and how does it affect data access?
  9. Describe how to group data in a Pandas DataFrame using the groupby function and calculate the mean of each group.
  10. Explain the different types of joins available in Pandas, including inner, outer, left, and right joins.

II. Answer Key

  1. A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. The index is crucial because it provides a way to access, filter, and search data within the DataFrame, acting as a label for each row.
  2. To read a CSV file, use pd.read_csv(‘file_path’). To handle Unicode errors, prepend the file path with r (e.g., pd.read_csv(r’file_path’)) to read the path as a raw string, preventing misinterpretation of backslashes.
  3. Use pd.read_table(‘file_path’, sep=’delimiter’) to read a text file into a DataFrame. The sep argument specifies the separator between columns in the text file (e.g., sep=’t’ for tab-separated).
  4. To specify column names when a CSV lacks headers, use pd.read_csv(‘file_path’, header=None, names=[‘col1’, ‘col2’, …]). This sets header=None to prevent Pandas from using the first row as headers and then assigns names using the names parameter.
  5. To filter by column values, use boolean indexing: df[df[‘column_name’] > value]. This selects rows where the condition inside the brackets is True.
  6. loc filters by label, using the actual index value (string, number, etc.) to select rows and columns. iloc filters by integer position, using the row and column number (starting from 0) to select data.
  7. To sort by multiple columns, use df.sort_values(by=[‘col1’, ‘col2’], ascending=[True, False]). The by argument takes a list of column names, and ascending takes a list of boolean values specifying the sorting order for each column.
  8. A MultiIndex is created using df.set_index([‘col1’, ‘col2’]), creating a hierarchical index. It allows you to select specific values based on either index (using .loc).
  9. Use df.groupby(‘column_name’).mean() to group data by a column and calculate the mean of each group. This groups rows with the same value in ‘column_name’ and computes the mean of the numeric columns for each group.
  • Inner: Returns rows with matching values in both DataFrames.
  • Outer: Returns all rows from both DataFrames, filling in missing values with NaN.
  • Left: Returns all rows from the left DataFrame and matching rows from the right, filling in missing values with NaN.
  • Right: Returns all rows from the right DataFrame and matching rows from the left, filling in missing values with NaN.

III. Essay Questions

  1. Discuss the importance of data cleaning in the data analysis process, providing specific examples of cleaning techniques relevant to the source material.
  2. Compare and contrast the different methods for filtering and sorting data in Pandas DataFrames, illustrating the use cases for each method.
  3. Explain the concept of indexing in Pandas and how MultiIndexing can be used to organize and access complex datasets.
  4. Describe how you can perform exploratory data analysis using Pandas and relevant libraries, and why it is important.
  5. Explain the concept of joining in Pandas and how different types of joins can be used to combine related data from multiple sources.

IV. Glossary of Key Terms

  • DataFrame: A two-dimensional labeled data structure in Pandas, similar to a table, with columns of potentially different types.
  • Series: A one-dimensional labeled array in Pandas, capable of holding any data type.
  • Index: A label for each row in a Pandas DataFrame or Series, used for data alignment and selection.
  • MultiIndex: A hierarchical index in Pandas, allowing multiple levels of indexing on a DataFrame.
  • NaN (Not a Number): A standard missing data marker used in Pandas.
  • Filtering: Selecting a subset of rows from a DataFrame based on specified conditions.
  • Sorting: Arranging rows in a DataFrame in a specific order based on the values in one or more columns.
  • Grouping: Aggregating data in a DataFrame based on the values in one or more columns.
  • Joining: Combining data from two or more DataFrames based on a common column or index.
  • Inner Join: Returns rows with matching values in both DataFrames.
  • Outer Join: Returns all rows from both DataFrames, filling in missing values with NaN.
  • Left Join: Returns all rows from the left DataFrame and matching rows from the right, filling in missing values with NaN.
  • Right Join: Returns all rows from the right DataFrame and matching rows from the left, filling in missing values with NaN.
  • Concatenation: Appending or merging DataFrames together, either horizontally or vertically.
  • Aggregation: Computing summary statistics (e.g., mean, sum, count) for groups of data.
  • Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize their main characteristics, often with visual methods.
  • Unicode Error: An error that occurs when reading a file with characters that are not properly encoded.
  • loc: A Pandas method used to access rows and columns by label.
  • iloc: A Pandas method used to access rows and columns by integer position.
  • Lambda Function: A small anonymous function defined using the lambda keyword.
  • Heatmap: Data visualization that uses a color-coded matrix to represent the correlation between variables.
  • Box Plot: A graphical representation of the distribution of data showing the minimum, first quartile, median, third quartile, and maximum values, as well as outliers.

Pandas Python Data Analysis Tutorial Series

Okay, here’s a briefing document summarizing the main themes and ideas from the provided text excerpts, which appear to be transcripts of a series of video tutorials on using the Pandas library in Python for data analysis.

Briefing Document: Pandas Tutorial Series Overview

Main Theme:

This series of tutorials focuses on teaching users how to leverage the Pandas library in Python for various data manipulation, analysis, and visualization tasks. The content covers a range of essential Pandas functionalities, from basic data input and output to more advanced techniques like filtering, grouping, data cleaning, and exploratory data analysis.

Key Ideas and Concepts:

  1. Introduction to Pandas and DataFrames:
  • Pandas is imported using the alias pd: “we are going to say import and we’re going to say pandas now this will import the Panda’s library but it’s pretty common place to give it an alias and as a standard when using pandas people will say as PD”
  • Data is stored and manipulated within Pandas DataFrames.
  • DataFrames have an index, which is important for filtering and searching: “as you can see right here there’s this index and that’s really important in a data frame it’s really what makes a data frame a data frame and we use index a lot in pandas we’re able to filter on the index search on the index and a lot of other things”
  • The distinction between a Series and a DataFrame is mentioned, suggesting that this will be covered in more detail in a later video.
  1. Data Input/Output:
  • Pandas can read data from various file formats, including CSV, text, JSON, and Excel.
  • The pd.read_csv(), pd.read_table(), pd.read_json(), and pd.read_excel() functions are used to import data.
  • Specifying the file path is crucial. The tutorial demonstrates how to copy the file path: “you have this countries of the world CSV you just need to click on it and right click and copy as path and that’s literally going to copy that file path for us so you don’t have to type it out manually”
  • The R prefix is used when reading files from a filepath to read the string as raw text.
  • The sep parameter allows specifying delimiters for text files: “we need to use a separator and I’ll show you in just a little bit how we can do this in a different way but with that read CSV this is how we can do it we’ll just say sep is equal to we need to do back SLT now let’s try running this and as you can see it now has it broken out into country and region”
  • Headers can be specified or skipped during import using the header parameter.
  • Column names can be manually assigned using the names parameter when the file doesn’t contain headers or when renaming is desired.
  • Imported DataFrames should typically be assigned to a variable (e.g., df) for later use.
  1. Data Inspection:
  • df.info() provides a summary of the DataFrame, including column names, data types, and non-null counts: “we’re going to bring data Frame 2 right down here and we want to take a look at some of this data we want to know a little bit more about it something that you can do is data frame 2. info and we’ll do an open parenthesis and when we run this it’s going to give us a really quick breakdown of a little bit of our data”
  • df.shape returns the number of rows and columns in a DataFrame.
  • df.head(n) displays the first n rows of the DataFrame.
  • df.tail(n) displays the last n rows of the DataFrame.
  • Specific columns can be accessed using bracket notation (e.g., df[‘ColumnName’]).
  • loc and iloc are used for accessing data by label (location) and integer position, respectively.
  1. Filtering and Ordering:
  • DataFrames can be filtered based on column values using comparison operators (e.g., df[‘Rank’] < 10).
  • The isin() function allows filtering based on a list of specific values within a column.
  • The str.contains() function allows filtering for rows where a column contains a specific string.
  • The filter() function can be used to select columns based on a list of items or to filter rows based on index values using the like parameter.
  • sort_values() is used to order DataFrames by one or more columns. Ascending or descending order can be specified.
  • Multiple sorting criteria can be specified by passing a list of column names to sort_values().
  1. Indexing:
  • The index is an important component of a DataFrame and can be customized.
  • The set_index() function allows setting a column as the index. The parameter inplace = True saves this to the existing dataframe.
  • The reset_index() function reverts the index to the default integer index.
  • Multi-indexing allows for hierarchical indexing using multiple columns.
  • sort_index() sorts the DataFrame based on the index.
  • loc and iloc are used for accessing data based on the index. loc uses the string/label of the index, iloc uses the integer position.
  1. Grouping and Aggregating:
  • groupby() groups rows based on the unique values in one or more columns. This creates a GroupBy object.
  • Aggregate functions (e.g., mean(), count(), min(), max(), sum()) can be applied to GroupBy objects to calculate summary statistics for each group.
  • The agg() function allows applying multiple aggregate functions to one or more columns simultaneously using a dictionary to specify the functions for each column.
  • Grouping can be performed on multiple columns to create more granular groupings.
  • The describe() function provides a high-level overview of aggregate functions, which is a shortcut.
  1. Merging and Joining DataFrames:
  • merge() combines DataFrames based on shared columns or indices. It’s analogous to SQL joins.
  • Different types of joins (inner, outer, left, right, cross) can be performed using the how parameter.
  • Suffixes can be specified to differentiate columns with the same name in the merged DataFrame.
  • join() is another function for combining DataFrames, but it can be more complex to use than merge().
  • Cross joins create a Cartesian product of rows from both DataFrames.
  1. Data Visualization:
  • Pandas integrates with Matplotlib for basic plotting.
  • The plot() function creates various types of plots, including line plots, bar plots, scatter plots, histograms, box plots, area plots, and pie charts, based on the kind parameter.
  • subplots=True creates separate subplots for each column.
  • Titles and labels can be added to plots using the title, xlabel, and ylabel parameters.
  • Bar plots can be stacked using stacked=True.
  • scatter() plots require specifying both x and y column names.
  • Histogram bins can be adjusted using the bins parameter.
  • Figure size can be adjusted to increase the visualization’s scale.
  • Matplotlib styles can be used to modify the appearance of plots.
  1. Data Cleaning:
  • Data cleaning involves handling missing values, inconsistencies, and formatting issues.
  • string.strip() removes leading and trailing characters from strings. Lstrip() removes leading characters, and Rstrip() removes trailing characters.
  • string.replace() replaces specific substrings within strings.
  • Regular expressions can be used with string.replace() for more complex pattern matching. The caret (^) can be used to return any character except.
  • apply() applies a function to each element of a column (often used with lambda functions).
  • Data types can be changed using astype().
  • fillna() fills missing values with a specified value.
  • pd.to_datetime() converts columns to datetime objects.
  • drop_duplicates() removes duplicate rows.
  • The inplace=True parameter modifies the DataFrame directly.
  • Columns can be split into multiple columns using string.split() with the expand=True parameter.
  • Boolean columns can be replaced with ‘yes’ and ‘no’ values to standardize responses.
  • isna() or isnull() identifies missing values.
  • drop() removes rows or columns based on labels or indices. The drop = True parameter drops a former index and creates an equivalent new one.
  • dropna() removes rows with missing values.
  1. Exploratory Data Analysis (EDA):
  • EDA involves exploring the data to identify patterns, relationships, and outliers.
  • Libraries: pandas (pd), Seaborn (sns), Matplotlib (plt).
  • info() and describe() provide high-level summaries of the data.
  • The float format can be adjusted via pd.setor_option.
  • isnull().sum() counts missing values in each column.
  • nunique() shows the number of unique values in each column.
  • sort_values() sorts the data based on specific columns.
  • corr() calculates the correlation matrix, showing the relationships between numeric columns.
  • Heatmaps (using Seaborn) visualize the correlation matrix.
  • Grouping (groupby()) and aggregation help understand data distributions and relationships across groups.
  • Transposing DataFrames (transpose()) can be useful for plotting group means.
  • Box plots visualize the distribution of data and identify outliers.
  • select_dtypes() filters columns based on data type.

Target Audience:

The tutorial series is designed for individuals who want to learn data analysis and manipulation using Python and the Pandas library, regardless of their prior experience with data science.

Overall Impression:

The series appears to be a comprehensive introduction to Pandas, covering a wide range of essential topics in a practical, hands-on manner. The instructor emphasizes best practices, common pitfalls, and useful techniques for working with real-world datasets. The inclusion of practical examples and visual aids helps make the learning process more engaging and effective.

Pandas DataFrame: Common Operations and FAQs

Frequently Asked Questions About Pandas Based on Provided Sources

Here are some frequently asked questions (FAQs) about using the Python Pandas library, based on the provided text excerpts.

1. How do I import the Pandas library and what is the standard alias?

To import the Pandas library, you use the statement import pandas. It’s common practice to give it the alias pd, like this: import pandas as pd. This allows you to refer to Pandas functions and objects using the shorter pd. prefix, which is a widely accepted convention in the Pandas community.

2. How do I read different file types (CSV, text, JSON, Excel) into Pandas DataFrames?

Pandas provides specific functions for reading various file formats:

  • CSV: pd.read_csv(“file_path.csv”)
  • Text: pd.read_table(“file_path.txt”) (often requires specifying a separator, e.g., sep=”t” for tab-separated files)
  • JSON: pd.read_json(“file_path.json”)
  • Excel: pd.read_excel(“file_path.xlsx”) (can specify a sheet name using sheet_name=”Sheet1″)

You typically assign the result of these functions to a variable (e.g., df = pd.read_csv(…)) to create a DataFrame object, making it easier to work with the data later.

3. What is a Pandas DataFrame and why is the index important?

A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a table with rows and columns. The index is a crucial component of a DataFrame; it provides labels for the rows. The index allows you to filter, search, and select data based on these labels. By default, Pandas creates a numerical index (0, 1, 2, …), but you can set a specific column as the index for better data access.

4. How can I handle Unicode errors when reading files?

When reading files with backslashes in the file path, you might encounter Unicode errors. To resolve this, prepend r to the file path string to treat it as a raw string. For example: pd.read_csv(r”C:pathtofile.csv”). This ensures that backslashes are interpreted literally and not as escape characters.

5. How can I deal with files that don’t have column headers, or if I want to rename headers?

When reading files, Pandas may automatically infer column names from the first row. You can override this behavior using the header argument. header=None tells Pandas that there are no existing headers, using the first row as data. You can then specify custom column names using the names argument, passing it a list of strings representing the new column names.

6. How can I filter data within Pandas DataFrames?

You can filter rows in a DataFrame based on column values using comparison operators (>, <, ==, etc.) or functions:

  • Filtering by Column Value: df[df[“column_name”] > 10] returns rows where the value in “column_name” is greater than 10.
  • Using isin(): df[df[“country”].isin([“Bangladesh”, “Brazil”])] returns rows where the “country” column contains either “Bangladesh” or “Brazil”.
  • Using str.contains(): df[df[“country”].str.contains(“United”)] returns rows where the “country” column contains the string “United”.

7. How can I sort and order data within Pandas DataFrames?

Use the sort_values() method to sort a DataFrame by one or more columns. The by argument specifies the column(s) to sort by. ascending=True (default) sorts in ascending order, while ascending=False sorts in descending order. You can sort by multiple columns by providing a list to the by argument. The order of columns in this list determines the sorting priority. You can also specify different ascending/descending orders for different columns by providing a list of boolean values to the ascending argument.

8. How can I perform groupby aggregations in Pandas?

The groupby() method groups rows based on unique values in one or more columns. You can then apply aggregate functions (e.g., mean(), count(), min(), max(), sum()) to the grouped data.

df.groupby(“base_flavor”).mean() # Mean ratings grouped by base flavor

You can use the agg() method to apply multiple aggregations to different columns simultaneously. The argument to agg() is a dictionary where keys are column names and values are lists of aggregation functions:

df.groupby(“base_flavor”).agg({“flavor_rating”: [“mean”, “max”, “count”], “texture_rating”: [“mean”, “max”, “count”]})

Pandas Library: Data Analysis with Python

The Pandas library in Python is a tool for data analysis, offering data structures like DataFrames and Series.

Key aspects of Pandas:

  • Alias When importing the Pandas library, it is common to use the alias PD.
  • DataFrames Pandas uses DataFrames, which are different from standard Python. When importing files using Pandas, the data is called in as a data frame. The index is an important component of a data frame, enabling filtering and searching. Assigning a DataFrame to the variable name DF is a common practice.
  • Series The next video in this series will explain what series are.
  • File Reading Pandas can read various file types such as CSV, text, JSON, and Excel. The specific function used depends on the file type (e.g., read_csv, read_table, read_json, read_excel).
  • File Paths File paths can be copied and pasted into the read function. To avoid Unicode errors, raw text reading may be necessary.
  • Arguments When reading files, arguments can be specified, such as the file path or separator.
  • Display Options Pandas allows you to adjust the display settings to show more rows and columns.
  • Data Inspection You can use .info() to get a quick breakdown of the data, .shape to see the dimensions (rows, columns), .head() and .tail() to view the first or last few rows, and column names to select specific columns.
  • Filtering and Ordering DataFrames can be filtered based on column values, specific values, or string content. The isin() function is available to check specific values. Data can be filtered by index using .filter(), .loc[], and .iloc[]. Data can be sorted using .sort_values() and .sort_index().
  • Indexing The index is customizable and allows for searching and filtering. The index can be set using set_index(). Multi-level indexing is supported.
  • Group By Pandas has the group by function to group together the values in a column and display them all on the same row. You can then perform aggregate functions on those groupings. The aggregate function has its own function (aggregate), where a dictionary can be passed through.
  • Merging, Joining, and Concatenating Pandas enables combining DataFrames through merging, joining, and concatenating.
  • Visualizations Pandas allows you to build visualizations such as line plots, scatter plots, bar charts, and histograms.
  • Cleaning Data Pandas is equipped with tools for data cleaning, including removing duplicates (drop_duplicates), dropping unnecessary columns (drop), and handling inconsistencies in data. The .fillna() function fills empty values.
  • Exploratory Data Analysis (EDA) Pandas is used for exploratory data analysis, which involves identifying patterns, understanding relationships, and detecting outliers in a dataset. EDA includes using .info() and .describe() to get a high-level overview of the data. Correlations between columns can be identified using .corr() and visualized with heatmaps.

Pandas DataFrames: Features, Functionalities, and Data Analysis

Pandas DataFrames are a central data structure in the Pandas library, crucial for data analysis in Python.

Key features and functionalities of DataFrames:

  • Definition A data frame is how Pandas calls in data, differing from standard Python.
  • Usual variable name Assigning a DataFrame to the variable name DF is a common practice.
  • Indexing The index is a customizable and important component, enabling filtering and searching. The index can be set using set_index().
  • Filtering and Ordering DataFrames can be filtered based on column values, specific values using isin(), or string content. Data can be filtered by index using .filter(), .loc[], and .iloc[]. Data can be sorted using .sort_values() and .sort_index().
  • Display Options Pandas allows adjusting display settings to show more rows and columns.
  • Data Inspection Tools like .info() provide a breakdown of the data. The .shape shows dimensions. Methods such as .head() and .tail() allow viewing the first or last few rows.
  • Merging, Joining, and Concatenating Pandas enables combining DataFrames through merging, joining, and concatenating.
  • Cleaning Data Pandas is equipped with tools for data cleaning, including removing duplicates (drop_duplicates), dropping unnecessary columns (drop), and handling inconsistencies in data. The .fillna() function fills empty values.
  • Exploratory Data Analysis Pandas is used for exploratory data analysis, including using .info() and .describe() to get a high-level overview of the data. Correlations between columns can be identified using .corr() and visualized with heatmaps.
  • File Reading When reading files using Pandas, the data is called in as a data frame.

Pandas: Data Import Guide

Pandas can import data from a variety of file types. When the files are imported using Pandas, the data is read in as a data frame. The specific function used depends on the file type.

Types of files that Pandas can read:

  • CSV
  • Text
  • JSON
  • Excel

Functions for reading different file types:

  • read_csv
  • read_table
  • read_json
  • read_excel

Key considerations when importing files:

  • File Paths The file path needs to be specified, and can be copied and pasted into the read function.
  • Raw Text Reading Raw text reading may be necessary to avoid Unicode errors. To specify raw text reading, use r before the file path.
  • Arguments When reading files, arguments can be specified, such as the file path or separator.
  • Alias When importing the Pandas library, it is common to use the alias PD.
  • Headers The header argument can be used to rename headers or specify that there is no header in the CSV. The default behavior is to infer column names from the first row. You can set header=None if there are no column names, which will cause numerical indexes to be created.
  • Separator When reading in a file, you can specify the separator. When pulling in a CSV, it will automatically assume that the separator is a comma. When importing text files, you may need to specify the separator.
  • Missing Data When merging data, if a value doesn’t have a match, it will return NaN.
  • Sheet names When importing Excel files, you can specify a sheet name to read in a specific sheet, otherwise it will default to the first sheet in the file.

Filtering Pandas DataFrames

Pandas DataFrames can be filtered in a variety of ways.

Filtering Based on Column Values

  • You can filter DataFrames based on the data within their columns. To do this, specify the column to filter on. Comparison operators, such as greater than or less than, can be used.
  • Specific values can be specified.

Filtering Based on Index

  • You can also filter based off of the index.
  • The main ways to filter by index are the .filter() function and the .loc[] and .iloc[] indexers.

The .filter() Function

  • With .filter() you can specify which columns to keep by using items = and then listing the columns.
  • By default, .filter() chooses the axis for you, but you can also specify the axis.
  • You can also use like = to specify a string, and it will filter by the indexed values that contain that string.

The .loc[] and .iloc[] Indexers

  • .loc[] looks at the actual name or value.
  • .iloc[] looks at the integer location.
  • With multi-indexing, .loc[] is able to specify the index, whereas .iloc[] goes based off the initial index, or the integer based index.

Pandas DataFrame Sorting: Values and Index

Pandas DataFrames can be ordered using the .sort_values() and .sort_index() functions.

Sorting by Values (.sort_values())

  • The .sort_values() function allows you to sort a DataFrame based on the values in one or more columns.
  • Specify the column(s) to sort by using the by parameter.
  • Determine the sorting order using the ascending parameter, which can be set to True (ascending) or False (descending). The default is ascending.
  • Multiple columns can be specified for sorting by passing a list of column names to the by parameter. The order of importance in sorting is determined by the order of columns in the list.
  • You can specify different ascending/descending orders for each column when sorting by multiple columns by passing a list of boolean values to the ascending parameter.
  • Example: To sort a DataFrame by the ‘Rank’ column in ascending order: df.sort_values(by=’Rank’, ascending=True).

Sorting by Index (.sort_index())

  • The .sort_index() function sorts the DataFrame based on its index.
  • You can specify the axis to sort on and whether the order is ascending or not.
Learn Pandas in Under 3 Hours | Filtering, Joins, Indexing, Data Cleaning, Visualizations

The Original Text

what’s going on everybody welcome back to another video today we are going to be learning pandas in under 3 [Music] hours so in this lesson we’re going to cover a ton of things as well as some projects at the very end you’re going to learn how you can read data into pandas and actually store it in a data frame we’ll be filtering quering grouping and a ton of other things just on that data and then we’ll be diving into Data visualization data cleaning exploratory data analysis and a ton more so without further Ado letun them on my screen and get started so the first thing that we need to do is import our pandas Library so we’re going to say import and we’re going to say pandas now this will import the Panda’s library but it’s pretty common place to give it an alias and as a standard when using pandas people will say as PD so this is just a quick Alias that you can use uh that’s what I always use and I’ve always used it because that’s how I learned it and I want to teach it to you the right way so that’s how we’re going to do it in this video so let’s hit shift enter now that that is imported we can start reading in our files now right down here I’m going to open up my file explorer and we have several different types of files in here we have CSV files text files Json files and an Excel worksheet which is a little bit different than a CSV so we’re going to import all of those I’m going to show you how to import it as well as some of the different things that you need to be aware of when you’re importing so we’re going to import some of those different file types and I’ll show you how to do that within pandas so the first thing that we need to say is PD Dot and let’s read it in a CSV because that’s a pretty common one we’ll say read CSV and this is literally all you have to write in order to call that in now it’s not going to call it in as a string like it would in one of our previous videos if you’re just using the regular operating system of python when you’re using pandas it calls it in as a data frame and I’ll talk about some of the nuances of that so let’s go down to our file explorer we have this of the world CSV you just need to click on it and right click and copy as path and that’s literally going to copy that file path for us so you don’t have to type it out manually you can if you’d like and we’re just going to paste it in between these parentheses now if we run it right now it will not work I’ll do that for you it’s saying we have this Unicode error uh basically what’s happening is is it’s reading in these backslashes and this colon and all those backslashes in there and this period at the end what we need to do is read this in as a raw text so we’re just going to say R and now it’s going to read this as a literal string or a literal value and not as you know with all these backslashes which does make a big difference when we run this it’s going to populate our very first data frame so let’s go ahead and run it and now we have this CSV in here with our country and our region now if we go and pull up this file and let’s do that really quickly let’s bring up this countries of the world it automatically populated those headers for us in the data frame but we don’t have any column for those 0 1 2 3 so if we go back as you can see right here there’s this index and that’s really important in a data frame it’s really what makes a data frame a data frame and we use index a lot in pandas we’re able to filter on the index search on the index and a lot of other things which I’ll show you in future videos but this is basically how you read in a file now if we go right up here in between these parentheses and we hit shift tab this is going to come up for us let’s hit this plus button and what this is is these are all the arguments or all the things that we can specify when we’re reading in a file and there are a lot of different options so let’s go ahead and take a look really quickly really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to yud me for sponsoring this Panda series and let’s get back to the video the first thing is obviously the file path we can specify a separator which there is no default so when we’re pulling in the CSV when we’re reading in the CSV it’s automatically going to assume it’s a comma because it’s a comma separated uh file you can choose delimers headers names index columns and a lot of other things as you can see right here now I will say that I don’t use almost any of these uh the few that I’m going to show you really quickly in just a second are up the very top but you can do a ton of different things and I’m just going to slowly go through them so that’s what those are you can also go down here this is our doc string and you can see exactly how these parameters work it’ll show you and give you a text and walk you through how to do this again most of these you’ll probably never use but things like a separator could actually be useful and things like a header could be useful because it is possible that you want to either rename your headers or you don’t have a header in your CSV and you don’t want it to autop populate that header so that is something that you can specify so for example this header one and I’ll show you how to do this uh the default behaviors to infer that there are column names if no names are passed this behavior is identical to header equals zero so it’s saying that first row or that first index which it’s like right here that zero is going to be read in as a header but we can come right over here and we’ll do comma header is equal to and we can say none and as you can see there are no headers now instead it’s another index so we have indexes on both the x- axis and the Y AIS and so right now we have this zero and one index indicating the First Column and the second column if we want to specify those names we can say the header equals none then we can say names is equal to and we’ll give it a list and so the first one was country and what’s that second one oh region so right here that’s the first um the first row but we’ll rename it and we’ll just say country and region and when we run that we’ve now populated the country and the region uh we’re just pretending that our CSV does not have these values in it and we have to name it ourselves that’s how you do it but let’s get rid of all that because we actually do want those in there so we’re just going to get rid of those and read it in as normal and there we go now typically when you’re reading in a file what you need to do is you want to assign that to a variable almost always when you see any tutorial or anybody online or even when you’re actually working people will say DF is equal to DF stands for data frame again this is a data frame in the next video in the series I’m going to walk through what a series is as well as what a data frame is because that’s pretty important to know when you’re working with these data frames but we’ll assign it to this value and then we’ll say we’ll call it by saying DF and we’ll run it and that’s typically how you’ll do things because you want to save this data frame so later on you can do things like dataframe Dot and you can uh you know pass in different modules but you can’t really do that it’s not as easy to do it if you’re calling this entire CSV and importing it every time so let’s copy this because now we’re going to import a different type of file so now we’ve been doing read CSV but we can also import text files now you can do that with the read CSV we can import text files let’s look at this one we have the same one it’s countries of the world except now it’s a text file because I just converted it for this video I’ll copy that as a path and so now when we do this oops let me get those quotes in there it’ll say world. txt it will still work as you can see this did not import properly um we have this country back SLT region and then all of our values are the exact same with this back SLT that’s because we need to use a separator and I’ll show you in just a little bit how we can do this in a different way but with that read CSV this is how we can do it we’ll just say sep is equal to we need to do back SLT now let’s try running this and as you can see it now has it broken out into country and region we could also do it the more proper way and this is the way you should do it and I’ll get rid of these really quickly but just want to keep them there in case you want to see that but you can also do read table and let’s get rid of this separator and now we have no separators just reading it in as a table let’s run this and it reads it in properly the first time this read table can be used for tons of different data types but typically I’ve been using it for like text files um we can also read in that CSV so let’s change this right here to CSV we can read it in as a CSV but just like we did in the last one when we read in the text file using read CSV this read table to you’re going to need to specify the separator so I’ll just copy this and we’ll say comma and now it reads it in properly again you can use that for a ton of different file types but you just need to specify a few more things if you don’t want to use the more specific read uncore function when you’re using pandas now let’s copy this again we’re going to go right down here and now let’s do Json files Json files usually hold semi-structured data um which is definitely different than and very structured data like a CSV where has columns and rows so let’s go to our file explorer we have this Json sample we will copy this in as path let’s paste it right here and we’ll do reor Json again these different functions were built out specifically for these file types that’s why you know each one has a different name so now we’re reading this in as the Json let’s read it in and it it in properly now let’s go ahead and copy this and take a look at Excel files cuz Excel files are a little bit different than other ones that we’ve looked at um so let’s just do read uncore cell and let’s go down to our file explorer and let’s actually open up this workbook as you can see we have sheet one right here but we also have this world population which has a lot more data let’s say we just wanted to read in sheet one one we can do that or by default it’s going to read in this world population because it’s the first sheet in the Excel file well let’s go ahead and take a look at that let’s get out of here and let’s say oops I forgot to copy the file path let’s go ahead and copy as path and we’ll put it right here and let’s just read it in with no arguments or anything in there or no parameters when we read it in it’s reading in that very first sheet so this is the one that has all of the data now let’s say we wanted to read in that extra sheet name or the second sheet name we’ll just go comma sheet unor name say is equal to and then we can specify sheet was it sheet one like this yes it was so we just had to specify the sheet name right here and then it brought in that sheet instead of the default which is the very first sheet in that Excel now that definitely covers a lot of how you read in those files again you can come in here and hit shift Tab and this plus sign and take a look at all the documentation and you can specify a lot of different things things that I didn’t think were very important for you guys to know especially if you’re just starting out the ones that we looked at today are what I would say are like the ones that I use almost all the time so I wanted to show you those but if you’re interested in any of these other ones or you have very unique data and you need to do that um you know it’s worth really getting in here and figuring things out a few other things that I wanted to show you just in this kind of first video or this intro video on how to read in files um one thing that you may have noticed especially in this file right here is we’re only looking at the first five and then the last five so if we wanted to see all the data all the data is in these like little three dots right here right we want to be able to see that data but right now we can’t and that’s because of some settings that are already within pandas and all we need to do is change that so this one has 234 rows and four columns so obviously we can see all the columns well let’s just change the rows all we’ll say is pd. setor option now what we need to do is we’re going to change the rows we’re not going to change the columns at least not on this one so we’ll say quote display. max. rows now if we just run this for whatever data we bring in it’s going to be able to show the max rows and then we’ll say 235 although this 34 rows I’m just going to be safe let’s run this and now it has changed it so let’s read in this file again and you’ll see how it’s changed now we have all the numbers and we have this little bar on the right that allows us to go down all the way to the bottom and all the way to the top so now we can actually look and kind of skim and see our values I like that better than just having that you know shorter version um we can do the exact same thing on columns as well so if we look at this one this is our Json file has the same thing right here we have what was it 38 columns but we can only see I think it’s it’s 20 or something like that I can’t remember um but we have 38 we can only see like let’s say 15 of them or 20 of them we’ll do the exact same thing and we’ll just say pd. set options. max. columns and we’ll set that to 40 for that one when we run this oops let’s get over here when we run this one again we can now scroll over and see every single one of our columns now that one is a in my opinion a lot more useful I like being able to see every single column so definitely something that you should be using especially when you have these really large files you want to be able to see a lot of the data and a lot of the columns so when you’re slicing and dicing and doing all the things we’re about to learn in this Panda series you know you know what you’re looking at I also want to show you just how to kind of look at your data in these data frames as well so that’s also pretty important so let’s go right down here and the very last one that we imported was this one right here this read Excel so this data frame is the only one that’s going to read in let’s run it um this is the last one to be run so this variable right here DF uh it won’t be applied to all these other ones um which we can always go back and change those typically you’ll do something like data frame two you want to do something like that um so let’s keep data Frame 2 oops so what we’re going to do is we’re going to bring data Frame 2 right down here and we want to take a look at some of this data we want to know a little bit more about it something that you can do is data frame 2. info and we’ll do an open parenthesis and when we run this it’s going to give us a really quick breakdown of a little bit of our data so we have our columns right here rank CCA 3 country and capital it’s saying we have 234 values in those columns because there’s 234 scroll up here because there’s 234 uh rows that tells me that there’s no missing data in here at least not you know completely missing like null values there is something in each of those rows the count tells me it’s non null so there’s no null values and it tells me the data type so it’s ringing in as an integer an object an object and an object and it also tells us how much memory it’s using which is also pretty neat because when you get really really large data types memory usage and and knowing how to work around that stuff does become more important than when you’re working at these really small You Know sample sizes that we’re looking at we can also do oops let me get rid of that can also do data frame two and we’ll do shape and for this one we do not need the parentheses and all this is going to tell us is we have 234 rows and four columns we’re also able to look at uh the first few values or rows in each of these data frames so we can just say dataframe 2. head and if we do that it’s going to give us the first five values but we can specify how many we want we can say head 10 it’ll give us the first 10 rows right here we can do the exact same thing and let’s go right down here and we’ll say tail so they’ll give us the last 10 rows within our data frame now let’s copy this and let’s say we don’t want to actually look at all of these values or all these columns we can specify that by saying df2 and oops let’s get rid of all of this and we’ll say with a quote we’ll say Rank and now we can take just a look at the rank data now we can’t do that by doing the index or at least not like this if we want to use this index that is right here we can but there’s a very special function called L and I look for that and I’m going to have an entire video on this because it does get a little bit more complex but there’s df2 looc and there’s Lo and IO stands for location and I location that’s only for the indexes whether it’s the x axis or the Y AIS those are the indexes and for location it’s looking for the actual text the actual string of the index so if we come up here that data Frame 2 we can specify 224 and it’ll give us this information right here in a little different format so let’s go bracket and we’ll say 224 and when we run this it gives us our rank CCA country capital with our values over here kind of like a dictionary almost now let’s copy this and we’ll say df2 do IO and right now these look the exact same but we haven’t really talked a lot about changing the index and you can change the index to a string or a different column or something like that and we’ll look at that in future videos the iock looks at the integer location so even if these um let’s go right up here even if this index had changed to let’s say this rank or this CCA three or country or whatever you make this index the ILO will still look at the integer location so that 224 would still be 224 even if it was usbekistan so then when we look at this it’s going to be the exact same but if we had changed that Index this Lo is the one that we could search on and we could search whoan is that you spell us beckan hey I nailed it so that is how you use Lo IO again I just wanted to show you a little bit about how you can look at your data frame or search within your data frame hello everybody today we’re going to be looking at filtering and ordering data frames in pandas there are a lot of different ways you can filter and order your data in pandas and I’m going to try to show you all of the main ways that you can do that so let’s kick it off by importing our data set so we’re going to say data frame is equal to and we’ll say pandas and I need to import my andas so we’ll say import andas as PD that’s pretty important I think um so pd. read CSV and we’ll do R and then we’ll say the world population CSV so let’s run this all our data frame right here and this is the data frame that we’re going to be filtering through and ordering in pandas so let’s kick it off the first thing that we can do is filter based off of The Columns so the data within our columns so Asia Europe Africa or whatever data we may have in that column so let’s go right down here we’re going to say DF and then within it we’re going to specify what column we’re going to be filtering on so we’re going to say DF with another bracket and we’ll say rank so we’re going to be looking at this rank column right here and then we’ll say in that rank column we want to do greater than 10 and that’s actually going to be a lot of them let’s do less than so when we run this it’s only going to return these values that are less than 10 we can also do less than or equal to you know all of these um comparison operators so less than or equal to so now we have all of the ranks 1 through 10 now if we look at these countries we can specify by specific values almost exactly like we did here but instead of doing a comparison operator like we did right here and including those names let’s say Bangladesh and Brazil we can use the is in function almost like an in function in SQL if you know SQL so let’s go right down here and we’re going to say specific underscore countries so right now we’re just going to make a list of the countries that we want and then we’ll say Bangladesh and Brazil so let’s go right down here and we’ll say okay for these specific countries from the data frame let’s do our bracket we’ll say in this country column so we’ll do data frame and then another bracket for Country so in this country column we can do do is in and then an open parenthesis and then look for our specific countries so we’re looking at just this column and we’re saying is in so we’re looking at are these values within this column and we’re getting this error and this looks very very odd let me um this doesn’t look right there we go I just had some syntax errors I apologize made it way more complicated than it needs to be but here’s how you use this is in function so we’re looking at Bangladesh and Brazil and we return those rows with Bangladesh and Brazil really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to Panda courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to you and me for sponsoring this Panda series and let’s get back to the video we can also do a contains function kind of similar to is in in except it’s more like the like in SQL as well I’m comparing a lot of this to SQL cuz when you’re filtering things I always my brain always goes to SQL but in pandas it’s called the contains so let’s do let’s actually copy this because I don’t want to make the same mistake again let’s do that and we’ll do the bracket but instead of dot is in we’re going to do string do contains and then an open parenthesis so we’re going to be looking for a string if it contain if it contains let’s do United almost like United States or or any other United so let’s run this and as you can see we have United Arab Emirates United Kingdom United States United States Virgin Islands so we can kind of search for a specific string or a number or a value within our data or within that column of country now so far we’ve only been looking at how you can filter on these columns we can also filter based off of the index as well and there’s two different ways you can do it or two of the main ways there’s filter and then there’s Lo and IO Lo stands for location and I look stands for integer location and if you’ve seen other previous videos I’ve kind of mentioned those so we can take a quick look at all of those so really quickly we need to set an index because the index right now is uh not the best we’ll set our index to Country so let’s say df2 is equal to D DF do setor index and we’ll say country I’m just doing df2 because later on I want to use that data frame again so I’m just going to assign it to another data frame so that we can just easily switch back and forth so now we have this index as the country and what we can do is use the filter function so let’s go down here we’ll say df2 filter and we’ll do an open parenthesis and now we can specify our items so these these are actually going to be specifying which columns we want to keep so we’re going to say items is equal to then we’ll make a list we’ll say continent hope that’s how we spell continent I’m always messing up with my uh my stuff here my spelling then we’ll do CCA 3 because why not you can specify whichever ones you want when we run this it’s going to only bring in those two columns Now by default it’s choosing the axis for us but we can also specify which axis we want to search on so if we say axis is equal to zero it’s actually going to search this axis this is the zero axis this is the one axis so where our columns are is one so if we go back and do one we’re searching on that one Axis or those header accesses again and this is the default but you can specify that so if you just want to search on uh you know filtering right here you can do that and let’s actually copy this and do that right down here just so you can see what it looks like but let’s let’s search for Zimbabwe and we’ll do Zimbabwe and we’ll be looking at the zero axis which is the up and down on the left hand side and when we filter on that we can filter by Zimbabwe by looking just at the country index we can also use the like just like we did before and I’ll show you the exact same demonstration that we did which you can say like is equal to and instead of having to put in a concrete um text you can just say United just like we did before and we’re searching where the AIS is equal to zero which again is this left-handed access so now we’re looking for United and it’s going to give us all of the countries or all the indexed values that have United in it like we were talking about before we also have Lo and ILO so we can say data frame 2. Loke now this is a specific value so we’ll do United States so location is just looking at the actual name or the value of it not its position so if we search for United States it’s going to give us this right here where it gives us all of the columns for United States and then all of the uh values for United States or we can do the io which is the energer location which is not the exact same because we’re looking at the string for the L we’re looking at this string but underneath it there still is a position that’s that integer location let’s do a completely Rand random one let’s just say three if we look at the third position it’s going to give us ASM which I’m not exactly sure what it is but it still gives us basically the same kind of output which is the columns and the values so that’s another way that you can search within your index when you’re actually trying to filter down that data now let’s go look at the order bu and let’s start with the very first one that we looked at let’s do data frame that’s why I kept it because I wanted to use it later now we can sort and order these values instead of it just being kind of a jumbled mess in here we can sort these columns however we would like ascending descending multiple columns single columns and let’s look at how to do that so we’ll say data frame and then we’ll do data frame look at rank again just like we were doing above and let’s do data frame where it’s less than 10 I should have just gone and copied this I apologize so now we have this data frame that is greater than 10 now we can do dot sort underscore values and this is the function that’s going to allow us to sort everything that we want to sort so we can do buy is equal to and we’ll just order it by the exact same thing that we were doing uh or calling it on we’ll do rank so now what this going to do it’s going to order our rank column and as you can see it did that 1 2 3 4 5 we can also do it with ascending or descending so if you want to you can look in here and see what you can do so we’ll do ascending we’ll say that’s equal to true and so that’s the automatic default so that didn’t change anything but if we say false it’s going to be descending from highest to lowest so now we have it in the opposite direction now we don’t have to just order or sort this on one single column we can do multiple columns and we can do that by making a list right here whoops make a list just like that and we’ll input different ones as well so now let’s input our country and when we run this it will give us rank of 9876 as well as the country of Russia Bangladesh Brazil now if you noticed the country really didn’t change because the rank stayed the exact same that’s because there’s an order of importance here and it starts with the very first one if we change this around and we look at this one and put a com right here now the country is going to to be descended and the rank would come second so it’s not going the rank isn’t going to really have any effect here so now we have the country United States Russia Pakistan and the rank really didn’t get ordered at all now if we want to see how that can actually work let’s do continent right here and actually put it right here and do country here so if we run this it’s first going to come and it’s going to organize or sort the continent then it’s going to come back and go to the country and then it’s going to sort the country so keep so keep your eye right here in this Asia area because we’re going to sort this differently than ascending so we have ascending false and that applies to both of these it’s false and false but we can specify which one we want to do we can do a false here and a true here so we’ll do false comma true and what this is going to do is it’s going to say false for the continent so the continent right here is going to stay the exact same and so that is a lot of how you can filter and order your data within pandas hello everybody today we’re going to be looking at indexing in pandas if you remember from previous videos the index is an object that stores the access labels for all pandas objects the index in a data frame is extremely useful because it’s customizable and you can also search and filter based off of that index in this video we’re going to talk all about indexing how you can change the index and customize that as well as how you can search and filter on that index and then we’re also going to be looking at something a little bit more advanced called multi indexing and you won’t always use it but it’s really good to know in case you come across a data frame that has that in it so let’s get started by importing pandas import pandas as PD now we’ll get our first data frame we’ll say DF is equal to pd. read CSV and I’ve already copied this but we’re going to do R and we’re going to put this file path so I have this world population CSV I will have that in the description just like I do in all of my other videos let’s run DF and let’s take a look at this data frame so we have a lot of information here we have rank country continent population as well as the default index from zero all the way up to 233 now if you haven’t watched any of my previous videos on pandas the index is pretty important and it’s basically just a number or a label for each row it doesn’t even necessarily have to be a unique number um you can create or add an index yourself if you want to and it doesn’t have to be unique but it it really should be unique uh especially if you want to use it appropriately for what we’re doing the country is actually going to be a pretty great index because the country you know is going to be all unique because we’re looking at every single row as a different um country as well as the population so let’s go ahead and create this country or add this country as our index now we can do this in a lot of different ways but the first way that you can do this if you already know what you are going to create that index on is we can just go right in here when we’re reading in this file and we’ll say comma index underscore oops I spelled that completely wrong index underscore column and we’ll say that is equal to and then we’re going to say quote country so we’re taking this country and we’re going to assign it as the index now let’s read this in and as you can see this is our index now it looks a little bit different we didn’t have this country header right here which is specifying that this is still the country but you can you can tell that this is the index based off the um bold letters as well as it being on the far left and all the regular columns for the data is over here while the country header is right here and it’s lower than all the others just a quick way that you can see that that is the index now before we move on I want to show you some other ways that you can do this as well but I’m going to show you how to reverse this index before we move on and we’ll say data frame so we had our data frame right here so we have data frame dot we’ll say reset unor index and then we’ll say in place is equal to True which means we don’t have to assign this to another variable and all that stuff it’ll just be true so now when we run that data frame again the index was reset to the default numbers so now let’s go down here and I’ll show you how to do this in a different way you can do DF do we’ll say setor index and then we’ll just say country so very similar to when we were reading in that file and we said set the index or that index column we said index column equals country if we do this and we run it in it works but if we say data frame right down here it’s not going to save that if we want to save it just like we did above we’re going to say in place is equal to true that is going to save it to where we don’t have to assign it another variable so now when we run this the data frame right here which is going to populate this the data frame is going to say in place is equal to true so that country will now be our index again let’s run this and there we go really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to you me for sponsoring this Panda series and let’s get back to the video now what’s really great about this index is we’re able to search based off just this index and so we can filter on it and and basically look through our data with it and there are two different ways that you can do that at least this is a very common way that people who use pandas will do to kind of search through that index the first one is called lock and there’s lock and iock that stands for location or integer location let’s look at lock first let’s say DF do loock and then we’ll do a bracket now we’re able to specify the actual string the label so let’s go right up here and let’s say Albania so we’ll say Albania so again this is just looking at the location let’s run this now it’s going to bring up all the Albania data just like here where it’s kind of looks like a colum in a column and we can get this exact same data but using iock right here and when we ran lock we were searching based off Albania which is in the 0 one position so if we actually pull the one position for that integer the ilock we can look at the one position and this should give us the exact same data now let’s take a look at multi- indexing and we’ll come back to a little bit of this in a second so multi- indexing is creating multiple indexes we’re not just going to create the country as the index now we’re going to add an additional index on top of that so let’s pull up our data frame right now we have the country but let’s do dot reset index and we’ll say in place equals true oops oops let’s run it so now we have our data frame now let’s set our index but this time when we set our index we’re going to add the country as the index as well as the continent as an index so we’ll say data frame. setor index then we’ll do a parenthesis and instead of just doing country like we did before we’re going to create a list oops and we’ll do it like that and then we’ll say oops continent and separate by a comma so we have continent and country let’s just say in place is equal to true now when we run this we’re going to have two indexes let’s see what this looks like and let’s run this so now we have country as well as continent as our index now you may notice that these indexes are repeating themselves on this continent index we have Europe right here and Europe right here as well as Asia and Asia and it looks a little bit funky but we are able to sort these values and make they look a lot better so let’s go ahead and try this we’ll do DF do sortore index and when we run this it should sort our index alphabetically and we can also look in here and see what kind of things we can you know specify we can specify the axis but it’s automatically going to be looking at the zero this is zero and this is one so we have two axes within our data frame you choose the level whether it’s ascending or not ascending in place kind string sort remaining all of these different things the only one that I really you know think is worth looking at is the ascending we already know some of these other ones but if we look at ascending let’s run it now it’s sorted these and so now it’s kind of grouped together so we have Africa and all the African ones as well as South America and all the South American ones let’s really quickly say pd. setor option and we’ll say display. max. columns and just like this let’s run it and I need to specify whoops specify right here let’s see how many rows we have 235 so let’s do 235 let’s run this and now when we run this you can see that Africa is all grouped together and all the countries are in alphabetical order under it and then we go all the way down to Asia and again just all in alphabetical order if we wanted to we could say ascending equals true and then when we run this Oh meant to say false and then when we run this it’s the exact opposite so it starts with South America the last one and then goes in reverse alphabetical order we could also say false make it a list and do comma true and just like this and then it would sort this First Column as false and this next column as true so you can really customize it but you know for what we’re doing we don’t need any of that we just need to be able to see this right here so now when we try to search by our index like we did before we did data frame. Loke now when we did that and we said you know let’s say Angola when we specified Angola it’s not going to work properly because it’s searching in this first index for the first string that we have we can search Africa and let’s search for Africa and now we have all of the African countries and if we want to specify to Angola we can also go down another level oops by doing angle Angola and now we have what we were looking at before where we’re calling all the data within those but we couldn’t do it just based off Africa because we had an additional Index right here here so once we called both indexes now we get this view but let’s look at that I look really quick when we run this let’s just say one because right up here oh we have Angola zero and then one so you think it may pull up Angola let’s go ahead and run this and it’s still pulling up Albania let’s go right up here if you remember when we didn’t have the multiple indexes it was pulling up Albania the difference when you’re doing these multi- indexes is that the L is able to specify this whereas this one does not go based off that multi- indexing it’s going to go based off the initial index or the integer based index so that’s a lot about indexing in pandas we’ll cover even a few more things in future videos as we get more and more into pandas but this is a lot of what indexing looks like within pandas and again super important to learn how to do and know how to do because it’s a pretty important building block as we go through this Panda series hello everybody today we’re going to be taking look at the group by function and aggregating within pandas Group by is going to group together the values in a column and display them all on the same row and this allows you to perform aggregate functions on those groupings so let’s start reading in our data and take a look so we’re going to do import pandas as PD and then we’re going to say our data frame is equal equal to and we’ll say pd. read CSV we’ll do an open parenthesis R and our file path and we’re going to be looking at the flavors CSV right here so right here we have our flavor of ice cream we have our base flavor whether it was vanilla or chocolate whether I liked it or not the flavor rating texture rating and its overall or its total rating now these are all my own personal scores so you know I’ve spent years researching this so these are all very accurate but this should be a low stress environment to learn Group by and the aggregate functions so the first thing that we can do is look at our group by now you can’t Group by well you can you can Group by flavor but as you can see these are all unique values what we need is something that has duplicate values or or similar values on different rows that’ll group together so this base flavor is actually a perfect one to group it on and we’ll do that by saying DF do group by do an open parenthesis and we’ll just specify base flavor and this will then group together those values and I need to make sure I can spell properly this will group those flavors together so let’s run this and as you can see it actually is its own object so it has a group by data frame Group by object so now that we’ve grouped them let’s give it a variable so we’ll say group underscore byor frame let’s say that’s equal to Let’s copy this we’ll run it and now what we need to do is run our aggregations in order to get an output so we’re going to say mean and that’s all we’re going to put just for now just to get an output that we can take a look off and then we’ll build from there so let’s go ahead and run this and right here we have our base flavor which is now saying is the index of chocolate or vanilla and then it’s taking the mean or the average of all all the columns that have integers notice that it did not take the liked column and it did not take the flavor column because those are strings and they cannot aggregate those and we’ll take a look at that later but it took all the values that have integers and then it gave us the average of those ratings really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about every everything you need to know about pandas so huge shout out to UD me for sponsoring this Panda series and let’s get back to the video so right off the bat as averages with chocolate I have a much higher rating overall than the ones with vanilla bases now we can actually combine all of this together into one line and we can do something like this so we’ll say DF do groupby and we’ll say mean just like this and this will actually run it before we didn’t have any aggregating function on there so didn’t run but now that we combine it all into one it will run properly now there are a lot of different aggregate functions but I’m going to show you some of the most popular ones or the most common ones that you will see so let’s copy this right here so we can do dot count and when we run this we can look at the count and this will show us the actual count of the rows that were aggregated so for chocolate we had three so there going to be three all the way across and for vanilla we had six so we’re looking at a higher count of vanilla which if you’re comparing it to this mean up here that could be a big skew towards the chocolate because if you have one or two good chocolates it could really pull the numbers up whereas if you had two good vanillas but all the other ones were bad it pulls that average down so knowing the count of something is really good let’s take a look at the next one and we can do Min and Max and I’ll just run these really quickly we can do Min and when we run this the first thing that you should notice is that it now has a flavor and a liked column and that’s because Min and Max will actually look at the first letter in the string or the first set of letters if there are um you know chocolate something it’ll look at the first and then it’ll actually populate it so chocolate with the CH chocolate is the very first or the minimum value for that string and for a cake batter that is the minimum value in vanilla as well now with the liked it’s interesting because apparently I liked all the chocolate ones I’m going to go take a look so chocolate I liked chocolate I liked chocolate I like so there is no no option in this liked column so yes was the only option and now let’s look at Max whoops and it should do the exact opposite which is going to take the highest value even if it’s a string so Rocky Road the letter R comes later in the alphabet so that’s what it’s looking at and so does vanilla and then we have yes as well and then of course right here it’s taking the max value so before when we were looking at Min I just focused on those but it still does the exact same thing to these integer um columns as well so for the max value for vanilla it was mint chocolate chip that was our base so I had a rating of 10 for this vanilla row or grouping and then we can also look at the sum and there are all the sums for these and again it only does integer because we can’t add the strings here are the sum or the total values for all of them and for the total values since we had you know six rows that were grouping into this vanilla we now have a lot or much higher score for vanilla now that’s a really simple way to do your aggregations but there is actually an aggregation function and let’s take a look at this because this is um a little bit more complex although when I write it out or show you hopefully it makes a lot of sense we can do a so this is our aggregate function and what we need to pass into our aggregate function is actually a dictionary so let’s do an open parenthesis and we’re going to do a squiggly bracket and then we need to specify what we’re going to be aggregating on or what column so let’s do this flavor rating let’s copy this we’ll do flavor rating and I need to put that as a string and then we’ll do a colon and now we can specify what aggregate functions we want so we’ve done sum count mean Min and Max all of those and we can actually put all of those into here and perform all of those aggregations on just one column so let’s make a list and then let’s say mean Max count and uh what’s another one sum so let’s do all four of those only on this flavor rating column and when we run this we have our base flavor right here chocolate and vanilla but now we don’t have multiple columns we have one column with multiple Columns of our aggregations and it is possible to pass in multiple Colum like that so we’ll do texture rating and we’ll just come right here and do a comma then we’ll say uh uh texture rating and then a colon I don’t know why I spelled it out when I copied it but I did and then we’ll do the exact same ones and now when we run it we’re getting the exact same columns mean Max count and sum for flavor rating then mean Max count and sum for our texture rating now so far we’ve only grouped grouped on one column but we can actually group on multiple columns let’s go back up here to our data and I should have just copy this down here let’s go back down and just look at this so really we only grouped it on this base flavor but you can do multiple groupings or group by multiple columns so let’s do our base flavor which we did already as well as the liked column so we’re going to say DF do group by then we’ll do an open parentheses and then instead of just passing through one string we’re going to do a list and we’ll say base flavor oops comma and then we’ll do liked so now when it groups this it should put two groupings and let’s run this and just see oops I got to say let’s just do mean so now we have our chocolate and a vanilla and remember chocolate only had yes so that’s the only one that it’s going to group on but vanilla had a no and a yes so if we look at the vanilla we have our base flavor vanilla and then within liked we have no and a yes which can show us that within our vanilla when we group on these our NOS were really low but our yeses were really high we actually had a pretty similar rating or very close to the same rating as the ones we really liked in chocolate and just like we did above we can take this doag and I’m going to copy this and it’ll perform it on each of those rows let me close that and what did I do wrong oh I need the squiggly bracket and it’ll show us each of those so the mean Max count and sum for all of the chocolate and vanilla as well as the groupings of light yes and no now after we’ve looked at all that and that’s how I usually do it there is one uh shortcut function that can give you some of these things just really quickly and so let’s go back up here and take this it’s just called describe um and if you’ve ever done it it’s just going to give you some highlevel overview of some of those different aggregations so let’s run this and it’s going to give us our chocolate and vanilla and within each column it’s going to give us our count our mean our standard deviation I believe is what that is our minimum 25% 50 75 and 100 which is our Max then our count and our mean so a lot of those aggregate functions but the describe is you know a very generalized um function we can’t get as specific as we were with the previous ones that we were looking at but I just wanted to throw this out there in case this is something that you’d be interested in because it you know technically is showing a lot of those aggregate functions just you know all at one time hello everybody today we’re going to be talking about merging joining and concatenating data frames in pandas this whole video is basically around being able to combine two separate data frames together into one data frame these are really important to understand when we’re actually using the merge and the join right here we have what’s called an inner join and the Shaded part is what’s going to be returned it’s only the things that are in both the left and the right data frames then we have an outer join or a full outer join and this will take all the data from the left data frame and the right data frame and everything that is similar so basically it just takes everything we also have a left join which is going to take everything from the left and then if there’s anything that’s similar it’ll also include that and then the exact opposite of that is the right join which is going to give us everything from the right data frame and it’s going to give us everything that is similar but it’s not going to give us anything that is just unique to the left data frame so this is just for reference because in a little bit when we start merging these these become very important so I just wanted to kind of show you how that works visually so let’s get started by pulling in our files so first we’re going to say import and is aspd we’ll run this and then we’ll say data frame one and we’ll also have a data frame two and these are the different data frames the left and the right data frame that we’ll be using to join merge and concatenate so we’ll say data frame one is equal to pd. CSV read and we’ll do R and here is our file path so we have this lr. CSV that’s our Lord of the Rings CSV and let’s call that really quickly so we can see what’s in there and I’m having a dyslexic moment uh because it’s supposed to be read CSV uh I apologize for that but this is our data frame this is our data frame one we have three columns it’s their Fellowship ID 101 2 3 and four their first name froto Sam wiise gandal and Pippen and their skills hide and gardening spells and fireworks so this is our very first data frame that we’re going to be working with let’s go down a little bit let’s pull this down here and we’re just going to say data Frame 2 Data frame two and this is the Lord of the Rings 2 so let’s pull this one in now as you can see it’s very similar we have Fellowship ID 1 2 6 7 8 so we have three different IDs here we don’t have six seven and eight in this upper this First Data frame we also have the first name so froto and Sam or Sam wise are in the very first and the second data frame but now we have three new people barir Eland and legalis and now we have this age column which again is unique to just this second data frame first one that I want to look at is merge and I want to look at merge first because I think this one is the most important I use this one more than any of the ones that we’re going to talk about today the merge is just like the joins that we were just looking at the outer the inner the left and the right and there’s also one called cross and I’ll show you that one although if I’m being honest I don’t really use that one that much but it’s worth showing just in case you come into a scenario where you do want to do that so let’s go right down here and I want to be able to see these while we do it so we’re going to say data frame one and when we specify data frame frame one as the very first data frame when we say data frame. merge this is automatically going to be our left data frame then if we do our parenthesis right here and we say data Frame 2 this is our right data frame and let’s see what happens when we do this so what it’s going to do and this we didn’t specify this it’s just a default it’s going to do an inner join so it’s only going to give us an output where specific values or the keys are the same now you can’t see this but what is happening is is it’s taking this Fellowship ID and saying I have 101 here a 102 here this is the exact same as up here with this Fellowship ID and fellowship ID of 101 and 2 but when we look at 13 and 4 those aren’t in this right data frame and 678 is not in this left data frame so the only ones that match are this 101 and2 and that’s why they get pulled in down here but because we didn’t explicit itely say here’s what I want to join or merge between these two data frames it actually is looking at the fellowship ID and the first name so it’s taking in these unique values of froto and Sam wise which are the same in both which is why I pulled it over but really quickly let’s just check and make sure that we did it on the inner join because again we didn’t specify anything that was just the default so we’re going to say how is equal to and then we’ll say enter and if we run this it’s going to be the exact same because again the inner is the default but now just to show you how it’s kind of joining these two uh data frames together I’m going to say on is equal to and then I’m only going to put Fellowship ID so let’s run this now the first thing that you may have noticed is this first name undor X and this first name uncore Y what the merge does as kind of a default is when you are only joining on a fellowship ID we have this right data frame with fellow ship ID the left data frame with the fellowship ID if you’re just joining on these and you’re not joining on the first name and the first name then it’s going to separate those into an underscore X and an underscore Y and even though they have the exact same values since we are not merging on that column it automatically separates that into two separate columns so we can see the values within each of those columns if we went into this on and we make a list and let’s do it like that and we say comma and then we write first name oops first name and then we run this it’s going to look exactly like it did before again it automatically pulled in both of these columns when it was merging at the first time even though we didn’t write anything but if we actually write this it’s doing exactly what it was doing when we just had df2 we’re just now writing it out now there are other arguments that we can pass into this merge function let’s hit shift Tab and let’s scroll down here So within this merge function we have a lot of different arguments you can pass into it first we have this right which is the right data frame which is this data frame two then we have the how and the on which we’ve already shown how to do there’s a left on right on left Index right index not something you’ll probably use that much but you definitely can if you want to look into that and there’s all these doc strings which show you exactly how to use all of these so if you’re interested in looking at the left and the right and the left index it’s all in here but one that is really good is the sort and you can sort it saying either it’s false or true then we have these suffixes now if you remember when we took these out what it automatically did was it put in these underscore X andore Y you can customize that and you can put in whatever you’d like instead of the underscore X andore Y you can put in some custom um string for that we also have an indicator and a validates again all things that you can go in here and look at I’m just going to show you the stuff that I use the most so these things right here are things that I definitely use the most so now that we’ve looked at the inner join let’s copy this right down here and let’s look at the outer join and these get a little bit more tricky I think the inner join is probably the easiest one to understand let’s look at the outer this spelled o u t r i I don’t know why I always want to say o u t t r but let’s run this and see what we get so now this looks quite different the inner join only gave us the values that are the exact same this one is going to give us all of the values regardless of if they are the same so we have 1 2 3 4 six seven and eight so let’s scroll back up here so we have 1 2 3 4 1 2 and 6 7 and 8 so we don’t have a 105 and then if you notice in this data frame right here if the value doesn’t have so if we can’t join on the fellowship ID or the first name like legalis wasn’t one that we joined on or that has a similar value in the left data frame it just gives us an Nan which is not a number and it’s going to do that for any value where it couldn’t find that join or it couldn’t match uh something within that either ID or first name so in age we also have that for the ones that weren’t in the right data frame we only had 101 and 102 so we’ll have the age for both Frodo and Sam but for Gandalf and Pippen we don’t have their corresonding IDs and so it’s just going to be blank for Gandalf and Pippen and you can see that right here so again outer joins are kind of the opposite of inner joins they’re going to return everything from both if there is overlapping data it won’t be duplicated now let’s go on to the left join and I’m going to pull this down right here and now we’re just going to say how is equal to left and let’s run this so what this is going to do is it’s going to take everything from from the left table or the left data frame right here so everything from data frame one then if there is any overlap it’ll also pull the overlapped or the you know whatever we’re able to merge on from data frame two so let’s go back up to our data frame one and two so it’s going to pull everything from this left data frame because we’re specifying we’re doing a left join so everything from the left data frame will be in there we’re also going to try to bring in everything from the right but only if it matches or or is able to merge so just this information right here will come over we weren’t able to join on 1006 1007 or 1008 so really none of that information is going to come over so let’s go down and check on this so again we have 1 2 3 4 all of the data with this first name and skills everything is in here but then we are trying to bring over the age but we only have matches with 1,1 and 10002 so only these two values will come in let’s look at the right join CU it’s basically the exact opposite let’s look at the right and this is basically the exact opposite of the left in the fact that now we’re only looking at the right hand and then if there’s something that matches in data frame one then we will pull that in so this is basically just looking like data Frame 2 except we’re pulling in that skills column and since only 1 And1 and 102 are the same that’s why the skills values are here now those are the main types of merges that I will use when I’m using a data frame or when I’m trying to merge a data frame but there also is one called a cross or a cross join uh and let’s look at this one and this one is quite a bit different here we go let’s run this so this one is different in that it takes each value from the left data frame and Compares it to each value in the right data frame so for froto in this left data frame it looks at the froto in the right data frame Sam wise in the right data frame legalis elron and baromir all in the right data frame then it goes to the next value Sam wise does the exact same thing Roto Sam wise legalis Elon baromir and it does that for every single value so let’s go right back up here so it’s taking this this 101 it’s comparing it to 1 2 3 4 5 then it’s taking Sam Wise It’s comparing it to one two 3 4 five Gandalf 1 two 3 four five Pippen and then you kind of see that pattern and that’s what a cross join is um there are very few in my opinion reasons for a cross join although you’ll if you ever do like an interview where you’re being interviewed on python you will sometimes be asked on Cross joins but there aren’t a lot of instances in actual work where you really use or need a cross join now let’s take a look at joins and joins are pretty similar to the merge function and it can do a lot of the same thing except in my opinion the join function isn’t as easily understood as the merge function it’s a little bit more complicated um but let’s take a look and see how we can join together these data frames using the join function so let’s go right up here we’re going to say data frame one. join and then we’ll do data Frame 2 very similar to how we did it before and let’s try running this and it’s not going to work um when we did the merge function it had a lot of defaults for us let’s go down and see what this error is it says the columns overlap but no suffix was specified so it’s telling us that it’s trying to use the fellowship ID and the first name just like the join did except it’s not able to distinguish which is which and so we need to go in there and kind of help it out a little bit again a little bit more Hands-On than the merge let’s see what we can do to make this work let’s do comma and we’ll say on and let’s really quickly let’s open this up and kind of see what we have so this one has less options than the merge does we have other and that’s our other data frame we can do on and we’re going to specify you know what column do we want to join on and then we can look at how do we want it to be a left an inner and outer the same kind of types of joins as the merge then we have that left suffix right suffix and that’s right here is kind of part of the issue that we were just facing is that those columns are the same but if we say left suffix it’ll give us an underscore whatever we want to specify any string four columns that are both in the left and the right we can give it a unique name so it we’ll no longer have that issue and then we can also sort it like we did on the other one but anyways let’s go back to our on we’ll say on is equal to and then we’ll say Fellowship ID let’s try running this and we’re still getting an error it’s just not as simple as the merge so let’s keep going so now let’s specify the type so we’ll say how is equal to and we’ll do an outer and if we run this it still doesn’t work we’re still getting the exact same issue as the left suffix and the right suffix so now let’s finally resolve it I just wanted to show you how a little bit more frustrating it was but now let’s say uh L suffix is equal to and now it automatically when we did the merge did an underscore X but we can do let’s do underscore left and then we can do a comma we’ll do right suffix and we’ll say it’s equal to and we’ll do underscore right now when we run this it should work properly let’s run this so this is our output and obviously looks quite a bit different over here we have this Fellowship ID then we also have Fellowship ID left first name left Fellowship ID right and first name right so it just doesn’t look right now something I didn’t specify when I first started this because I kind of wanted to show you is that the join usually is better for when you’re working with indexes before when we were using the merge We Were Us using the column names and that worked really well and it was pretty easy to do but as you can see right here when we’re trying to use these column names it’s not working exceptionally well let’s go ahead and create our index and then I can show you how this actually works and how it works a little bit better when we’re working with just the index although you can get it to work just the same as the merge it’s just a lot more work so let’s go right down here and let’s go and say df4 so we’ll create a new data frame we’ll say df1 do set _ index and we’ll do an open parenthesis and we’ll say we want to do this index on the fellowship ID and then we’re going to do the join so now we’re going to say join so we’re setting an index so we’re setting that index on the fellowship ID now we’re going to join it on df2 do setor index and then we’re also going to do that on the fellowship ID and I’ll just copy this oh jeez I hate it when I do that okay now we also want to do and specify the left and the right index so I’ll just copy this as we do need to specify this now let’s try running the data frame four so really quickly just to recap we were setting the indexes we were doing the same thing above right we have this join we were joining data frame one with data Frame 2 now we’re joining data frame one with data frame two except in both instances we’re setting the index as Fellowship ID so we’re joining now on that index so now let’s run this and this should look a lot more similar to the merge than the join that we did above except now the fellowship ID right here is actually an index so it’s just a little bit different but we can still go in here and do how is equal to Outer oops let’s say outer so we can still specify our different types of joins or the different way that we can merge or join these data frames together we can still specify that again it’s just a little bit different and that’s why for most instances I’m using that merge function because it’s just a little bit more seamless little bit more intuitive the join function can still get the job done but as you can see it takes a little bit more work now let’s look at concatenate concatenating data frames can be really useful and the distinction between a merge and join versus the concatenate is that the concatenate is kind of like putting one data frame on top of the other rather than putting one data frame next to one another which is like the merge and the join so concatenating them is just a little bit different in how it’ll operate but let’s actually write this out and see how this looks let’s go up here and we’ll say pd. concat we’ll do an open parenthesis and then we’re going to concatenate data frame one comma data Frame 2 that’s all we have to write and let’s run this and so just like I said it literally took the First Data frame 1 2 3 4 and put it on top of the right data frame 1 2 6 7 8 so that is our left data frame this is our right data frame and they’re literally just sitting one on top of the other but just like when we merge either with a left or a right when you have these skills and there aren’t any values that populate for them it is going to say not a number and since we’re not actually joining we’re not joining on one and two even though this one and this one is the same rows it’s not populating that value because again we’re not joining these together we’re just concatenating and putting one on top of the other now if we go into this concat we say shift tab there are a lot of different things that we can do which if you remember the zero axis is the left-and index and the axis of one is the top index which is the columns so you can specify that and we can also do joins and this is the one that I’m going to take a look at but there are other ones that you can um look into as well but let’s look at join let’s do comma and we’ll say join is equal to and let’s do an inner join so let’s see what happens with this as you can see it is only taking the columns that are the same that’s what this inner is doing it’s joining these columns together and the ones that were different they didn’t take because again we weren’t able to combine them they aren’t similar between both data frames Let’s do an outer and now it’s going to take all of them and like I said that’s doing this on these colums right here but we can also do it on this axis as well so let’s go ahead and say a is equal to 1 and when we run this now it’s joining us on this Index right here of 0 1 2 3 4 so now these ones are being joined together and it’s putting it side by side much like a merge would so that’s how concatenate works and I’m going to show you one more thing and again it’s not up here in this you know title because it’s not one that I recommend but is one called append the append function is used to append rows from one data frame to the end of another data frame and then we can return that new data frame and so let’s do data frame 1. append do an open parenthesis and we’ll say data Frame 2 very similar to how we’ve been doing other things and let’s run this and as you can see this is almost exactly like how the concatenate did when we first did it but if we read kind of this warning it’s saying the frame. append method is deprecated and will be removed from pandas in the future version use pandas do canat instead so it’s literally warning us you know a pend is on its way out if you want to do exactly what you’re doing right here go and try concat or concatenate because that’ll do the exact same thing so I’m not really going to show you any other variations of a pend because there’s no reason it’s going to be on its way out in the next version so that is our video on merge join and concatenate and aend as well uh in pandas and I hope that that was helpful I hope that you learned something I mean this stuff is really important because often times you’re not just working with one CSV or one Json or one text file you’re working with multiple of them and you need to combine them all into one data frame and so this is a really really important concept and thing to understand hello everybody today we’re going to be building visualizations in pandas in this video we’ll look at how we can build visualizations like line plots Scatter Plots bar charts histograms and more I’ll also show you some of the ways that you can customize these visualizations to make them just a little bit better with that being said let’s go right over here start importing our libraries and we’ll start with importing pandas SPD and this one is really all you need to actually create the visualizations in pandas but we may get a little bit crazy uh and so we’re going to do a few different ones as well like import numpy as NP and then we’re going to do import matplot li. pyplot as PLT now I may or may not use this I just you know when I get into visualizations I may want to change some different things so we’re going to at least have them here in case we do want to use them let’s go ahead and run this so now let’s our data set that we’re going to be using so let’s say data frames equal to pd. read CSV and let’s get this in right here now we’re going to be doing these ice cream ratings let’s take a look at this really quickly now these values are completely randomly generated they are not real in any way um but that’s what we’re going to be using because I just wanted something kind of generic something that wouldn’t be too crazy confusing just something that we could use and you guys can understand that there’re just numerical values vales but let’s also set that index really quick so we’ll say data frame. setor index and then we’ll say date and then we’ll say that’s equal to the data frame and we have this date column right here as our index so we have uh January 1st second third fourth and then we have our ratings right here and again these are all just integers and they’re pretty easy or really easy to demonstrate how you can visualize these so that’s why we’re using it today so the way that we visualize something in pandas is use something called plot so let’s just take our data frame we’ll do data frame. plot and we’ll do our parentheses now let’s go in here really quickly let’s hit shift Tab and this is going to come up and this is pretty important because this kind of is going to tell us what we can do within this plot and unfortunately there isn’t like a quick overview we just have this dock string but we have our parameters right here these are what we can pass in to kind of customize our visualization so the data is going to be our data frame then we have our X and Y labels we can specify the kind and this one’s important because we can specify what kind of visualization do we want we can do a line plot horizontal a vertical bar plot histogram box plot and then a few others including area Pi density all these other things we can also specify if we want it to be a subplot and a lot of these things that I’m specifying you know I’m going to show you how to do you can use uh different indexes you can add titles add grids Legends Styles all these different things I mean you can go through here because there are a lot but you can specify and you know customize all of these things we won’t be going into all of them but I will show you some of the ones that I probably use the most and that I think are the most useful to know right away so let’s get out of here and we’re just going to do DF do plot and when we run this we’ll get this right here and that was super super easy created a line plot by literally doing just about nothing um but by by default it’s going to give us a line plot so if we come up here we say kind and let me get that out of the way is equal to line and we run this so by default without us actually having to input anything it’s giving us that line plot as a default so uh we can specify it’s a line plot as you can see we already have all of our data right here we didn’t have to specify anything it kind of automatically took it in it is visualizing all three of these columns and it has this little um Legend right here and we can specify where we want that uh there is an argument to be able to do that it also gave us these tick marks of 2 4 6 8 10 again it read in and it said it’s only going from 0.0 to 1.0 that is kind of the peak and so it kind of automatically gave us these ticks for us again that’s another thing that you can specify we can make it go up to 2 5 10 1,000 whatever you want it to be and then we’re doing this based off of this date value right here here really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to you me for sponsoring this Panda series and let’s get back to the video if we wanted to break these out by the actual column we could go in here and say subplot is equal to true and it’s actually subplots whoops and now we can run that and then we can see each of those columns being broken out by themselves instead of them all being in one visualization it’s now uh three separate visualizations now let’s go right over here we’re going to get rid of the subplots I want to show you just some of the different arguments that you can use to make this look nice uh because I don’t want to do this on every single visualization I just want to show you what you can do so we have this one right here we can add a title notice there’s no title or anything really telling us what is so we can say comma idle and we’ll say ice cream ratings if we run this we now have this nice title right here now we can also customize the labels or the titles for the X and Y AIS it automatically took this date which is right here this is our date index it automatically took that for us but we can customize that if we’d like to all we have to do is comma and then we’ll say x label is equal to and so our X is the this date one right here and we can say daily rating and then we can do the Y label we’ll say y label is equal to and for this one we can say scores hope you cannot hear my dog in the background cuz they’re being insane uh but let’s go ahead and run this and now we have these daily ratings on the x- axis and on the Y AIS we have scores now let’s go right down here and start taking a look at our next kind of visualization which is going to be a bar plot so we’ll do DF do plot we’ll do kind is equal to and for this one we’re going to say bar now this is what your typical bar plot will look like and a lot of the arguments that we just did on the line plot you can also apply to this bar plot something that’s unique to the barplot is that you can also make it a stacked bar plot all we have to do is go in here we’ll say comma and we’ll say stacked is equal to true so now let going to make it a stacked bar chart instead of just you know your regular bar chart let’s go ahead and run this and as you can see this is now stacked on top of one another with each of these columns all representing the values that they have now we don’t always have to do every single column we can also specify the column that we want so let’s take the flavor rating for example we could do flavor oops flavor rating good night flavor rating and then it’s only going to take in that flavor rating column and if you notice we don’t have a legend that’s only when you have multiple values which we are only looking at this one column so all the values are right here now in this bar chart it automatically defaults to a vertical bar chart but you can change it to a horizontal bar chart let’s go ahead and take a look at how to do that bring back all of them we’ll do DF do plot Dot and then we’ll say barh and I don’t know if I can keep in that kind equals bar let me run this yeah I need to get rid of that because the bar. H is its own um this is its own function so now I’m going to run this it should just have a stacked bar chart except now it should be horizontal so now you can see this worked properly it’s basically the exact same thing as a vertical bar chart just now horizontal which may look better especially depending on if you have values like this or you know something else that just looks better being horizontal now the next one that we’re going to take a look at is the scatter plot so we’re going to say DF do plot. scatter and if we run this we’re going to get an error what we need in order to run this properly is we need to specify the X and the Y AIS in order for this scatter plot to work so let’s go here and we’ll say x is equal to and we can take any of our columns that we have up here so we’ll say x is equal to texture rating and then oops Y is equal to we’ll do overall rating now when we run this it should work properly let’s go ahead and take a look now if we go in here and we do shift tab we can also see some other things that we can specify so let’s go right down here so we have our X and we have our Y and those are the ones that we just did we can also pass through an S which is going to tell us or or change the size of the actual dots right here in our scatter plot then we can also do a c which is the color of each point let’s start with the S let’s say s is equal to and let’s just do 100 let’s see what that looks like so we have a much larger number let’s do 500 and see what that looks like so we can make these much larger on our visualization depending on what you’re looking for we can also look at the color let’s put comma C so for color we can say color is equal to and let’s do uh yellow let’s see if this works so now we’ve changed it to Yellow that looks absolutely terrible but it does work now let’s move on to the histogram histogram is always a good one it’s very similar to something like a bar chart but what’s great about a histogram is you can specify the bins um so let’s go ahead and say DF dolot doist then we’ll do an open parenthesis and let’s go ahead and hit shift tab in here take a look at this one as well so some of our parameters are the actual columns or the data frames that we want to pull in we can choose the bins and they have a default default of 10 in here and so let’s take a look at how this works so we’ll just run this as it is so this is by default what this histogram is going to look like let’s go ahead and specify our bins we’ll just say it was 10 by default let’s just do 20 see what that looks like so there are smaller columns right off the bat and remember histograms are really good for showing distribution of variables you know that’s really what a histogram is for but of course since these are completely random numbers this histogram isn’t going to make any sense at all but you can at least kind of see visually how it works and if I didn’t mention it before which I should have the bins represent how many kind of tick marks are down here so if we just do one it’s only going to be one very large uh you know histogram we could even go further down from 10 and do five so now there’s only one 2 3 four five so the distribution gets smaller and things get more compact as you spread it out again like we did 100 it’s going to spread it out a lot um and this is what it shows you know it’s showing the distribution of those bins across however many you want so the 10 by default you know it usually is pretty good for a lot of different things now let’s go down here and look at the box plot and the box plot is a pretty interesting one let’s go ahead and visualize it really quickly and then I’ll kind of explain how this one works let’s do DF dobox plot that’s r on this and really what we’re looking at is some different markers within our data this line right here is the minimum value within that column we also have the bottom of the box which is the 25th percentile of all the values within just this column this is 50% then we have 75% and then up here we have our maximum value so I can take a glance at this and see that we have a low minimum a high maximum and it definitely skews towards the lower range whereas if I look over here we have a lower minimum and a higher maximum and you can see that this medium point is at 6 versus 04 over here so this skews a lot higher now let’s go down here and take a look at an area plot we’ll do DF do plot. area and let’s just run this this is what we’re going to get by default now something I wanted to show you earlier I just haven’t gotten around to I want to show you something called Figure size or fig size um so for this it’s know it’s just looks small looks a little bit cramped let’s say we want increase the size of this and we’ll say fig size oops fig size is equal to and let’s just do a parentheses and say 10 comma 5 that should be pretty large this is going to make it a lot larger just something I wanted to throw in there I look at these area charts as pretty similar to like a line chart if we went and compared those be pretty similar um but they’re different visually and you know you absolutely can use these for different types of visualizations but I don’t use this one a lot if I’m being honest that’s why why it’s kind of towards the end of the video but you definitely can do it let’s go on to our very last one of the video that’s going to be the beautiful pie chart let’s say DF plot.py do an open parenthesis and let’s run it we’re going to get this error that’s because we need to specify what column we’re working with here so let’s just say the Y and that’s what we need me open this up for us right here we have our Y and this is our our label or our column that we’re going to plot that’s really all we need so we can just say Y is equal to flavor rating oops flavor rating let’s run this now we get this visualization right here let’s make this one a little bit bigger big size is equal to 10 comma 6 now it’s a little bit bigger it definitely depends so this Legend is going to autop populate you know you can make this as big as you want and obviously it’s going to look a little bit better if you do it larger and these colors autop populate now you can customize these colors although I found these ones to be just when you have a lot of them it’s harder to customize them as easily but you know definitely look into it these are things that everything in here is almost something that you can customize in some way although it does get a little bit tricky you definitely have to do some research and some Googling around just to kind of figure out how to do those things now one last thing that I wanted to show and something you know I could have probably done at the beginning um is you can actually change what visual this is and we can do that pretty easily within mpot lib there are different styles um and so let’s go right here let’s add a new row a new cell and we’ll say print and we’ll do PLT so that’s that map plot lib right here we’ll do PLT do style. available and what this is going to do whoops what this is going to do is show us all these different types of uh stylings that you can do to kind of change up this visualization then once we find the one that we like we’ll just do PLT do style. use and then in the parenthesis we’ll just specify which one we want now there’s all these Seaborn ones and Seaborn is a really great um really great Library let’s try Seaborn deep I haven’t tried this one at all let’s go ahead and try this and just changes some of the colors some of the visuals we can try something like like 538 let’s try this that looks quite a bit different and let’s try something like um classic I don’t know what this one looks like let’s just try it so you can try out all these different styles find one that you’d like find one that you think looks really nice and you can run with it through all your visualizations hello everybody today we’re going to be cleaning data using pandas now there are literally hundreds of ways that you can clean data within pandas but I’m going to show you some of the the ones that I use a lot and ones that I think are really good to know when you are cleaning your data sets so we’re going to start by saying import andas as PD and we’re going to run that and now we’re going to import our file so we’re going to say data frame is equal to PD so that’s pandas do read uncore and we actually have this in an Excel file so we’ll say read oops say read Excel do an open parenthesis and we’ll do R and then we’ll paste the path right here and now we’re just going to call that variable so we’ll call data frame and we’ll actually read it in and look at the data so let’s scroll down here and let’s take a look at this data frame or this Excel file that we’re reading in so right off the bat we have this customer ID that goes from 101 all the way down to20 we have this first name and everything looks pretty good here except in this last name column uh looks like we have some errors we have some forward slashes some dots some null values um so definitely going to have to clean that up because we don’t want that in the data we have a phone number and it looks like we have a lot of different formats um as well as Naas not a number um just lots of different stuff so we’re going to need to standardize that so clean it up and then standardize it to where it all looks the same um we also have address and it looks like on some of these we just have a street address but on some of the other ones we have like a street address and another location as well as a zip code in some of them so we’ll probably want to split those out we have a paying customer uh which is yes and Nos and some of those are not the same so I have to standardize that we have a do not contact kind of the same thing as the paying customer and we have this not useful column which we’ll probably just want to get rid of okay so the scenario is is that we got handed this list of names and we need to clean it up and hand it off to the people who are actually going to make these calls to this customer list so they want all the data in here standardized and cleaned so that the people who are making those calls can just make those calls as quickly as possible but they also don’t want columns and rows that aren’t useful to them so things like this not useful column we’re probably going to get rid of and then ones that say do not contact if it says yes we should not contact them we probably will want to get rid of those somehow so that’s a lot of what we’re going to be doing to clean this data set normally the very first thing that I do when I’m working with a data set most of the time except very rare cases when you’re actually supposed to have duplicates is I actually go and drop the duplicates from the data set completely all you have to do for that is say DF do dropcore duplicates so they make it super easy for you let’s just run it and up here is our original data set we have this 19 and 20 and those are obviously duplicates they have the exact same data it’s just a duplicate row that we need to get rid of if we look right down here we we no longer have that 20 we now just have one row of Anakin Skywalker and of course we want to save that so we’re just going to say DF is equal to and DF so now it’s going to save that to the data frame variable again and now when we run this our data frame Now does not have any duplicates that’s definitely one of the easier steps that we’re going to look at uh things are going to get quite a bit more complicated as we go but I’m starting out you know kind of simple so that we can kind of get a feel for it then we’ll start getting into the really tough stuff so the next thing that I want to do is remove any columns that we don’t need I don’t want to clean data that we’re not going to use so if we’re just looking through here you know they may need you know first name last name phone number for sure address might give them some information of where they’re calling to or time zone so we want that this not useful column looks like a pretty good candidate to delete and it’s very easy to do that we’re going to go right down here and we’re going to say DF do drop we’ll do an open parenthesis drop just means we are dropping that column and we can specify that by saying columns is equal to and then we’ll paste in that column that we want to delete so let’s run this and see what it looks like and it literally just drops that column exactly like we were talking about it no longer has that column again we want to save that we can always do in place equals true um if you follow this tutorial series you can always do in place equals true and that’ll save it as well but just for our workflow most of the time I’m going to assign it back to that variable um just for keeping it the same really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to UD to me for sponsoring this Panda series and let’s get back to the video now let’s kind of go column by column and see what we need to fix and we’ll start on this left hand side this customer ID to me looks perfectly fine I’m not going to mess with it at all the first name at a glance also looks perfectly fine I don’t see anything wrong with it visually which is a good thing um although sometimes that can be deceiving and that can cause errors down the line but we’re not going to uh assume that there are errors in here now let’s look at this last name now the last name obviously I’m I’m seeing some obvious things things that we talked about when we were first looking at this data set we have this forward slash which we definitely need to get rid of we have null values so not a number right here we have some periods as well as an underscore right here so all those things I think we should clean up and get rid of it so that when the person is making these calls you know it’s all cleaned up for them so how are we going to do that we can actually do this in several different ways but let’s just copy this last name the first one I’m going to show you is strip and we’ll write it kind of like this we’ll say data frame and then we’ll specify the column that we’re working with because we don’t want to make these changes or strip all of these values from everywhere we only want to do it on just this column if we do this and we don’t specify the column name it will apply to everywhere so if we’re trying to do these yeah let’s say bum these underscores maybe that would mess with something else in another column and we don’t want that so we just want to specify just this last name so let’s go last name. string Dot strip now what strip does and let’s see if we can open this up really quickly no we can’t um but what strip does I was just I was hitting shift tab in here to see if it could bring up um you know some of the notes on it but what strip does is it takes either the left side or the right side well L strip takes from the left side R strip takes from the right side and strip takes from both but you can strip values off the left and the right hand side and we can specify those values now for what we’re doing in this column we can just use strip because as you can see this forward slash these dots as well as this um underscore are all on the far sides if there was a value Like swancore Son the strip wouldn’t work at all because it’s not on the outside of the value of the word so we can use strip I’ll also show you how to use replace and replace is another really good option for things like this but let’s start with strip and just see what it looks like and see if we can get what we need done done so let’s just run this for now see what happens so it looks like nothing has changed because again we’re not specifying any specific value just by default it’s only taking out white space so like spaces that shouldn’t be there that’s what it does by default now we can specify within this exactly what values we want to take out so let’s go ahead and do that let’s say left strip and let’s try to take out these dots real quick so we’re just going to do a parenthesis dot dot dot now let’s run this and see what it looks like for this one Potter it is now gone so those three dots were there before let’s just show it so they were there and then when I ran it like this now they’re gone that’s what the L strip does it takes it only off the left hand side now we can also do a forward slash so we’ll do something like this and it’ll get rid of the white but as you can see now we aren’t taking out these three dots so they’re still there now is it possible to do something like this where we put these values inside of a list um let’s try it so we’ll say just like this one two three let’s run it and no it doesn’t um this L strip actually sits within the the realm of regular expression so if you’ve ever worked with regular expression you know it gets very complicated very complex so you want to keep it kind of simple especially with these values where we’re just taking a few out so what we’re going to do is we’re going to do dot dot dot and we’re take it out one by one now in order to save this because we want to save this we want to take out that value we don’t just want to say data frame equals because that would be uh very bad what this would say is now this data frame is only equal to these values that we’re seeing right here we want to only apply it to this column so we’re going to go like this so now when we do it and then we call the entire data frame it’s only applying this to this one column the last name column so let’s run it and now when we go down to Potter right here it’s cleaned up so we’re going to do the same thing but for those other values and we’ll do it just like this we’ll do a forward slash and it’s a left strip and then we’ll do I’ll do the left strip on this underscore to just to show you that it won’t work and then we will go on from there so it’s not pulling it because we’re looking at the leftand side only we need to use R strip so now let’s use R strip and now that looks perfect as no underscore so that’s how you can use strip for either the left side the right side or just Strip by itself which covers both sides now I showed you all of that because I am going to show you a different way to do it um and I apologize because I somewhat lied to you earlier um let’s run this right here actually we’re just going to pull it in like this we’re going to remove the duplicates again bear with me we’re going to drop that column and then now we’re sitting with that data frame again with those exact same mistakes I just wanted to reset it for a second there is a way uh that you can do this and I just wanted to you know kind of show you how you can do it you can do this right here and we’ll say so we’re now again we’re just looking at this column just this column and we’re using strip and let’s get rid of R because we want to do apply it to everywhere you can input all of those values individually and it will clean it up so let’s say we want to get rid of numbers we’ll do 1 2 3 then we can do the dot so that’s going to be for our period or for our dot dot dot Potter we could also do the underscore and we can do the forward slash so we put it all in one string right here now let’s take a look at this we’ll get rid of this really quickly now let’s take a look and all of them were removed I showed you how to do it before because that’s at least how my mind would think about it I’d think oh I can put it in a list and run it through this L strip or this right strip and it would work um but that’s not how strip works you have to kind of combine it all into one value so uh yes I deceived you I apologize but now when we call data frame and we assign it to that column so the last name column or assigning what we just did to this last name column everything should look perfect and it does so our customer ID first name last name are all cleaned up now we’re going to come to a much more difficult one this is probably if I’m being honest the hardest one I said we were going to work up but this is probably the hardest one of the whole video working with phone numbers and look at all these different types of of formats I mean it is um it’s not going to be fun and imagine you know there’s 20,000 of these you can’t just go and manually clean those up you need something to kind of automate that so that is what we’re going to do so let’s go right down here copy the data frame and I’m going to pull it right here so now we need to clean up this phone number what we want is it all to look exactly the same unless it’s blank and we’ll keep it blank we don’t want to populate that data but we want all of them to look exactly like this one and what we’re going to do is right off the bat we’re going to take all of the non-numeric values and just completely get rid of them strip it down to just the numbers so this 1 23- 643 or forward slash will just be the numbers same with these bars and these slashes and everything all of these will just be numeric then we’ll go back and reformat it how we want to format it which will look exactly like this one um but we just want to do it for the entire column so let’s go right up here and we’re going to try replace for the first time so let’s do phone number just oops that’s not what I wanted so we’re going to do a bracket say phone number do string. replace just like we did before now we’re going to use some regular expression in here and I’ll kind of do a really high overview although I’m not going to dive super deep into the regular expression then we’re going to do a parenthesis and within there we’re going to do a bracket um I can’t remember what this is called is it called a carrot I think it’s called a carrot uh B I’m just going to call it that it may not be correct but I think it’s a an upper Arrow so it’s an upper Arrow a a d oops A- Z A- Z and then 0-9 now at a super high level what that character that first thing is doing it’s saying we’re going to return any character except and then we specify anything A to Z A to Z upper or lowercase and then actually I think this should be like this A to Z uh and then 0 to 9 so any value like a BC One Two Three those are not going to be matched it’s going to match all of them except these values and then we’re going to replace them by saying comma and we’re going to replace them with nothing so this is just an empty string so literally we’re taking everything that is not an A A B C A 1 two 3 so a letter or a number we’re replacing all of that and then we’re replacing it with nothing so let’s run this and see what it looks like and it looks like that worked properly now we do have this na because we had an n- a for I don’t remember maybe that was Creed Bratton um but it worked for basically everything else we’re going to go through the entire process and then at the end we’ll remove any values we want them to just be completely null we we don’t want them to even see n an and wonder what that is we just want it to be blank and we’ll do that at the very end so now that we know that that worked let’s assign it we’ll do DF phone number is equal to and then we’ll say data frame and this looks a lot more standardized than it did before already but now what we want to do is try to format this um and I’ve done this many many times I always use a Lambda you can definitely use a for loop I just I don’t do it that way myself so I’m going to show you how to do it using a Lambda let’s get rid of this and we’re going to say thef phone number we’ve already done that I’m just going to get rid of it now we’re going to say DF phone number then we’re going to say do apply we’ll do an open parentheses and then this is where we’re going to build out our Lambda so we’ll say Lambda X colon now this is where we’re going to kind of format it so what I want to do is I want to take the first three strings 1 2 3 then I want to add a slash and then the next three strings add a slash or a dash uh and then that be the value that’s returned so it’s not super difficult we’re just going to do X then a bracket let me get rid of that an X and then a bracket and then we want the 0 to three so it goes 01 2 so 0 1 2 it doesn’t include the three it goes up to three so 0 1 2 that’s our third first three values then we’ll do plus and do a quote and do a dash so this is our first kind of sequence and I’m just going to copy this we’ll do plus and instead of three or we are going to start at three because now it’s inclusive so we’re going to go from three and we’re going to go all the way up to six so it should be three four five our next three values then we have a dash and we’ll copy this and we’ll say plus and now we go from six all the way to 10 now let’s try running this and as you can see we get an error now I already know what the error is float object is not subscriptable which means we’re trying to um basically look at it like a string right now it’s not a string it’s actually a number so let me get rid of this for just a second I’ll going show you what it’s talking about so right now we have values that are floats and values that are strings or not even a number so we have values that are strings or not a number so if we want to actually look through it like kind of like indexing if we want to do that they all have to be strings so we need to change this entire column into Strings before we can apply this um formatting now when I was creating this if I’m being honest my first thought when I was doing this was to do it like this string DF phone number um let’s just run that this is what the values look like um and I don’t remember why or why it was doing this I can’t I can’t remember but I looked into it quite a bit and I was like oh I need to apply this string converting it to a string on each value not the entire row or not the entire column so how we can do that is actually fairly easy because we’ve already done a lot of the heavy lifting we’re just going to copy this and we’re going to say x so string of X and again Lambda is like a little anonimous function so you could do this by saying for um X in this uh column we could do a for Loop and then say for every X it equals the string of X and then it changes it to a string but a Lambda just does it a lot quicker um so we’re going to say so let’s do that really quickly and all of our values look exactly the same and that’s how we want it so we’re just going to copy this apply it good and now we’re going to take this and we’re going to run this again just ignore all my commented out stuff pretend I don’t have that um so now when we run this it should work there we go now if we look at these numbers 1 2 3- 545 D 5421 and it does that for every every single one where there’s values even when there’s n n or na it’s still adding those values but we expected that so let’s apply it says equal to and then we’ll look at the data frame and this looks almost exactly what we’re hoping for we just need to get rid of these so this n- Dash and this na Dash we need to get rid of those and that is super easy to do um we’re just going to say so now that we’ve done it and we I me it out we’ll say DF and let’s copy this ignore the messiness I do apologize for that it’s very messy um but if you’re following along with me you get what we’re doing so DF phone number so only on the phone number say string. replace parenthesis now we can specify this value so we want to take this exact value and replace it with nothing and let’s just see if that does work it does now we have these Nas and so let’s actually I’ll paste that right down here we’re going to do this is equal to and then we’re just going to take this entire string put it right here and put this value as our what we’re looking for and then replacing and then when we call that data frame it should work properly and it is perfectly cleaned so so we have every single value all the exact same they don’t have different characters or different um you know formatting and we got rid of all the ones that we don’t have or don’t need um all the ones that were just random values so this column is now completely cleaned up again definitely one of the more difficult ones um one that I’ve done a thousand times I’ve had to work with a lot of phone numbers and stuff like that this one does get very tricky especially if you have like a plus one which is like an area code um that can get tricky as well but this is on a kind of a high level this is how you can do that and it’s pretty neat how you can actually you know clean up and standardize those phone numbers so let’s go right down here uh let’s run it the next thing that we’re going to look at is this address now let’s just pretend that the people who are on the call center want all these separated into three different columns they can read it easier see what the ZIP code is where they live uh you know whatever they want it for let’s just say we want to do that and this is you know again for this use case it may not make sense but you have to do this I do this all the time um you need to split those columns now luckily all of these things are separated by a comma so we can specify that we’re going to split on this column and then we’ll be able to create three separate columns based off of this one column which is exactly what we want then we can name it as well and we can do that very easily by using this split so we’re going to say DF and we want to specify oh jeez not again so we want to specify that we’re looking at the address then we’re going to say string. split we’ll do an open parenthesis now the very first value that we need to specify is what we’re splitting on so we want to split on the comma so we want to specify that and then we need to specify how many values from left to right it should look for now we’ll just start with one and then we’ll go from there let’s just see what this looks like so it doesn’t really look like it did anything let’s do two well let’s go back to one and then let’s say expand equals true when we expand it it’s actually going to uh separated I believe okay so we’re expanding now we’re only doing this with one comma so we’re only looking at the very first comma and splitting it but in some of these well just in one there is an additional comma so we should do it up to two let’s do this okay so now we have three columns if we just save it like this it’s going to give us these 0 one2 these basically these indexed values for these columns and we don’t want that we want to specify what these actually are and we can do that by saying DF and let me just do is equal to we’ll do bracket and then within there we’re going to specify our list so we have three of them that we have so I’m going to do um the first one this is the street address so we’ll say street address the next one is and it’s sh is not a state uh but these all are state so I’m just going to say State and then for the very last one that looks like a zip code so we’ll say zip and we’ll do code in fact I also want to do streetcore address um so what this is now going to do is these three columns are going to be applied to these three names and they’ll basically be appended it’s doesn’t replace the address we’re not saying DF address equals the DF address we’re not replacing it we’re now creating different columns so let’s run it and then let’s also call it so they’re right over here on this right hand side I couldn’t see them at first but it did exactly what we needed it to do so now if we wanted to at the very end if we want to we’re not going to we could just delete this address and keep the street address the state and the zip code another really common thing that you can do this happens often again with like first name last name well you’ll have Alex freeberg but it’s Alex comma freeberg or Alex space freeberg and you can separate those out into different columns now the next one that we want to look at is this paying customer and the paying customer and do not contact are very similar um in the fact that it’s yes no NY yes no NY um and so let’s go right on down here and we’re going to say DF Dot and we we want to just replace these values as all yeses or all NOS but just with the same formatting um just to keep it consistent so let’s make anything that’s an N into a no anything that’s a a y into a yes I like it spelled out so let’s change anything that’s a yes into a y and anything that’s uh a a no into an N that’s usually how I do it just saves on data because it’s less strings although it’s be often very minimal um but let’s specify the in customer we see say DF bracket Pay customer then we’ll do string. replace so now we’re just going to look for those specific values so if it’s a y oops a capital Y then we’ll say yes now let’s run it and now we have no more y we now just have yeses although now these are yes yeses okay we don’t want to do that let’s do if we’re looking because it’s taking it’s literally looking up here and saying okay there’s here’s a y um let’s change the let’s change that Y into a y so now it’s doing ye uh we don’t want that so let’s look for the yes and change it into a y now when we run this that looks a lot better um so we’ll do D of paying customers equal to and then we’ll copy this we’ll do the exact same thing no and N then let’s call it and now that entire column looks really good except for that value right there but I’m going to leave that because I’m just going to apply it to the entire thing all at once to get rid of those at the end instead of just going column by column and then it’s literally going to be the exact same thing so I’m not even going to scroll down whoops I’m just going to put it right up here because this is the exact same thing I’m going save us all some time and when we run this this looks exactly like what we’re looking for again some not a number of values but we can get rid of that in just a second by doing our place over the entire data frame and that is basically the end of cleaning up individual columns now let’s go right down here we’re going to say DF do string. replace and then we’ll first do these values oops so we’ll do oops let me do that there we go and replace that with nothing let’s just see what it looks like oops data frame object has no value string well that’s because we were looking at columns before yeah I think I just need to get rid of this string we’re not looking it we’re just doing it across the entire data frame now let’s try that okay that worked appropriately and we’ll just say data frame is equal to and then we’ll copy this and we’ll do the NN as well and we’ll [Music] do and now when we do this it is not going to replace these because these aren’t actually a value because we’re looking for that string we actually need to use and I I completely forgot this I’m not going to lie to you um let’s get rid of this uh to get rid of those values because it’s literally not a number there it is technically empty um I forgot we can do um or we could not even specify it we’ll do DF do fillna so we’re going to fill these values if there’s nothing in them we’re going to fill it and we’re going to say blank and when we run that every value that doesn’t have something in it is going to show up blank even over here where we only had a few all of them throughout the data frame if it doesn’t have a value it is now blank so let’s apply that and and we’ll run this and now all of our cleaning we’re actually cleaning up the individual columns is completely done we’ve removed columns we’ve split columns we’ve formatted and cleaned up phone numbers we’ve also taken values off of first name or or this last name column and then we formatted in just kind of standardized paying customer and do not contact now they also asked us to only give them a list of phone number numbers that they can call so if we take a look some of these do not contacts are why which means we cannot contact them and then there are some that don’t even have phone numbers so we don’t want to give the people the call center numbers that or or people who don’t have numbers so we want to remove those now there’s a few different ways that we can do this but let’s start with and we’ll just go by do this do not contact it seems like the most obvious one now if it’s blank we want to give them a call we only want to not call them if they’ve specifically said we cannot call them so if it’s y we’re not going to call them so what we need to do it’s not anything like this we probably need to Loop through this column and then look at each row that has a value of this and drop that entire row uh and we probably will’ll need to do that based off this index instead of doing it based off just this column uh that may not make sense but let’s actually let’s actually start writing it so we’ll do 4X in and we need to look at our index so we’re just going to do let’s do in DF do index and we’ll do a colon enter and then we want to look at these indexes how do we look at these indexes we use lock that’s going to be DF do Lo and then we need to look at the value which is this x right here so each time it looks at the index it’s looking at the value but we want to look at the value of this column do not contact I don’t know if I copied this before let me copy it we only want to look at the value in this one column if we didn’t it would look at um a different value so we don’t want that so we’re looking at just that value if it’s equal to Y so if this value is equal to Y then we want to drop it so we actually need to say if so if this value X in this column is equal to Y then we want to do DF do drop and then we’ll say x and we I think we have to say in place equals true here otherwise it won’t take a fact um otherwise have to say like DF is equal to DF I don’t I don’t want to start messing with that let’s just do in place equals true um and let’s see if that works I I can’t remember if this is going to work or not invalid syntax okay neon and now let’s try to run this okay okay yeah if we look at our index we can already tell that there are ones missing the one the one is missing the three is missing um let’s see and the 18 is missing so we already got rid of those values and you can you can see that there’s no y’s in here anymore which is really good we can if we want to and we probably should we should probably populate that um really quickly um let me just go up here really quick I’ll copy this we probably should populate that and I didn’t plan on doing this so um if it’s blank oops it’s blank give it an n and we want to attribute it to do not contact do not contact whoops let’s see if that works and we probably need to do dot string let’s just see if it works so if it’s blank dude okay I don’t know why it’s giving us a triple n maybe there’s maybe I need to strip this or something uh okay never mind let’s not do that but now we basically need to do the exact same thing for this phone number um because if it’s blank we don’t want them calling it um so we can copy this entire thing go right down here and but now we’re looking at phone number so now we’re looking just at the values within phone number and we only want to look at if it’s blank so if it literally has no value we want to get rid of it let’s run this and see if it works again it should good and now our list is getting much smaller so you can see in our index a lot of um those rows were removed and and okay good actually this worked itself out because these all have ends um so right now we’re sitting really good everything looks really um standardized cleaned everything looks great I might drop this address if you want to you can drop this address but besides that this is all looking really good this pain customer doesn’t uh the yes and knows aren’t really anything um now we could and we probably should before we hand this off to the client or the customer call let’s we probably should reset this index because they might be confused as why there’s numbers missing or you know they might use this index um to show how many people they’ve called or I don’t know something like that so let’s go right down here we’re going to say DF Dot and then we’ll do reset index and let’s just see what this looks like um it does work but as you can tell it didn’t uh get rid of that index completely it actually took the index and saved that original one we do not need to save that whoops let’s put it right in here now we’re just going to do drop equals true and when we do that it just completely resets it drops the original index and gives us a new index and that is what we want let’s do DF equals and this is our final product now one thing that I you definitely could have done here um and I made this a little probably more complicated than it needed to be um that was just how my brain was working at the time when I’m you know typing this out we could could have done DF do drop an a um which is literally going to look at these null values um before we couldn’t do that with this one because these aren’t we’re not looking at na we’re looking at y’s so we couldn’t do that but because we’re looking at null values we could have also done drop na um and done subset is equal to and then done it just on this phone number and then done like this and done in place equals true so we could have also done this and then said DF equals um I can’t I mean I can run it it’s just not going to do anything I can run it on the different column but that’ll me mess everything up but this is another way you can do it and I’ll just save it in case you want to um I’ll say another way to drop null values there you go and that’ll just be a note for us in the future um but this is our final product it looks a lot different than when we first started I mean we had mistakes here completely different formatting in the phone number different address everything that we just talked about um and this looks just a lot lot better and you can tell why it’s really important to do this process because again we’re working on a very small data set I I purposely you know created this data set with these mistakes because you know when you’re looking at data that has tens of thousands 100 thousands a million rows these are all things that are going to be applied to much larger scale and you won’t be able to as easily see them um you’ll have to do some exploratory data analysist to find these mistakes and then you’re going to need to clean the data or doing it at the same time when you’re exploring the data uh so you’ll clean it up as you go but these are a lot of the ways that I clean data a lot of the things that you can do to make your data just a lot more standardized a lot more um visually better and then it really helps later on with visualizations and your you know actual data analysis so hello everybody today we’re going to be looking at exploratory data analysis using pandas exploratory data analysis or Eda for short is basically just the first look at your data during this process we’ll look at identifying patterns within the data understanding the relationships between the features and looking at outliers that may exist within your data set during this process you are looking for patterns and all these things but you’re also looking for um mistakes and missing values that you need to clean up during your cleaning process in the future now there are hundreds of ways to perform Eda on your data set but we can’t possibly look at every single thing so I’m just going to show you what I think are some of the most popular and the best things that you can do when you’re first looking at a data set the first thing that we’re going to do are import our libraries so we’ll do import andas as PD we’re also going to import Seaborn and matplot lib now during this exploratory data analysis process I often like to visualize things as I go because sometimes you just can’t fully comprehend it unless you just visualize it and it gives you a a larger broader glimpse of everything so we’re going to import and let’s do caborn oops as SNS and then we’ll import Matt plot li. pyplot as PLT let’s run this this should work okay perfect now we need to bring in our data set so we’ve worked with that world population data set that is the exact one that we’re going to use now so we’ll say dataframe equals pd. read CSV do R and we’ll paste in our CSV and this is what it should look like although your path may be different be sure to make sure that you have the correct file path then we’ll read it in now this data set should look extremely familiar if you’ve done some of my previous pandas tutorial but I did make some alterations to this one took out a little bit of data put in a little bit of data here and there um to change things up because if it was just exactly how I pulled it which I got this data set from kaggle if it was exactly how we pulled it like we’ve looked at in the previous videos it’s too simple you know we wouldn’t actually be able to do some of the things that I would like to show you so be sure to actually download this exact data set for this video because it is a little bit different but what we’re going to do now is just just try to get some highlevel information from this now if yours looks just a little bit different like your values are in scientific notation uh I have applied this so many times I think it’s um you know still applied to this you can do something and we’ll write it right down here we’re going to do pd. setor option and we’ll do an open parenthesis and we’ll say display. flator format and so we’re going to change that float format by just saying Lambda X colon and then we’re going to change basically how many um decimal points we’re looking at so let’s just do here so we do a quote sign 2f so we’re formatting it whoops 2f so we’re going to format it and we’ll do percent X this is going to format it appropriately I’m I can run it um and actually it will change it CU this is at0 one I believe last time I did it so let’s run this and then let’s run this again n it’ll change it to0 2 so that’s two I like it at 0.1 we don’t really need it any well let’s keep it at0 2 why not we’re going to keep it at0 two that’s how you change that and I like looking at it like this a lot better than scientific notation so just something to point out um let’s go down here and let’s just pull up data frame so we have this data one of the first things that I like to do when I get a data set is to just look at the info so we’re going to do do info and this gives gives us just some really high level information this is how many columns we have here are the column names here are how many uh values we have and if you notice this is where it kind of gets so we have 234 in each of these so in each of these columns we have 234 until we get to this 2022 population once we get there we start losing some values and then at the world population percentage we have all of our values all 234 of them the count tells us that it’s nonnull so it does have values in it and then we also have the data types and these come in handy later um and these are really great to know and we’ll be able to kind of use those in a few different ways later on in this tutorial really quickly I wanted to give a huge shout out to the sponsor of this entire Panda series and that is udemy udemy has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it’s going to teach you just about everything you need to know about pandas so huge shout out to UD me for sponsoring this Panda series and let’s get back to the video the next thing that I really like to do and this one is DF do describe this allows you to get really a high level overview of all of your columns very quickly you can get the count the mean the standard deviation the minimum value and the maximum value as well as your 25 50 and 75 percentiles of your values so just at a super quick glance there is a row somewhere in here and there this country their population is 510 for 2022 and in fact if you go back to 1970 it was higher was at 752 that’s just interesting then if we look at the um max population one has 1.42 billion I believe that’s China and then over here in 1970 we have 822 million again I still believe that’s China but this gives you just a really nice high level of all of these values all these different calculations that you can run on it and we can run all these individually on even specific columns but you know this just a nice high level overview one thing that we just talked about was the null values that we’re seeing in here um I’d like to see how many values we’re actually missing because that is a problem um we don’t want to have too many missing values that could really obscure or change the data set entirely and so we don’t want that so we’ll say DF do is null and then we’ll do a parenthesis we’ll say do sum and when we do this whoops dot sum there we go when we do this it’s going to give us all the columns and how many values we’re actually missing now we have 234 rows of data so we have 41477 55424 um so we have we definitely have data missing what we choose to do with it in the data cleaning process maybe we want to populate it with a median value Maybe we just want to delete those countries entirely if the data is missing um you know I don’t think you’re going to do that but these are things that you need to think about when you’re actually finding these missing values this is what the Eda process is all about we want to find different um either outliers missing values things that are wrong with the data or we can find insights into it while we’re doing this as well so this is definitely something that I would consider um when I’m actually going through that data cleaning process really important information to know now let’s go right down here go to our next cell say DF do unique and this is going to show us how many unique values and it’s actually n unique uh this is going to show us how many unique values are actually in each of these uh columns and this one makes the most sense um for continent because I think there’s only seven continents right um but we have six right here and for all of these each of these ranks countries capitals should all be unique that makes perfect sense as well as these you know these populations are such specific numbers in such large numbers I would be shocked if any of these were similar and then for these world population percentages it’s much lower and again that makes a lot of sense because when we’re looking at and we’ll pull it up right here when we’re looking at these world population percentages um a lot of them are really low 0.00 0.01 like this one um 0 .2 there are a lot of really low values for those small countries and so those are all um you know one unique value now let’s say we just have this data right here and we want to take a look at some of the largest countries and we can easily do that we could even we could say Max and take a look at the largest country but I want to be a little bit more strategic I want to be able to look at some of the top range of countries and we can do that based off this 2022 population so we’ll say DF do sort underscore values this is how we sort and um not filter but um order our data so we’ll do sort values and then we’ll do buy is equal and then we’ll specify that we want uh this 2022 population and then we’re going to say comma and we’ll say actually let’s just run this as is um but we’ll do head because we just want to look at the top values so now we’re just looking at the very top values so what we’re looking at is actually these 2022 population um that’s what we’re filtering on or sorting on basically and we’re looking at the very bottom values because it’s sorting ascending so from lowest to highest so this Vatican City in Europe is um you know 510 that’s the value that we were looking at earlier now we can do comma ascending equal to false because it was by default true we can do false whoops we can do false and then it’ll give us the very largest ones so if we just take a look at the top five largest by population we’re looking at China India United States Indonesia and Pakistan and we can even specify that we want the top 10 in this head we can bring in the top 10 and we also have Nigeria Brazil Bangladesh Russia and Mexico and you can do this for literally any of these columns whether you want to look at continent capital country um you can sort on these and look at them and you can even look at you know things like growth rate world percentage this one seems really interesting let’s just look at this one really quick before we move on to the next thing um if we look at this world percentage just China alone I believe yep just China alone is 17.88% of the world so 17.88% again just getting in here looking around that’s all we’re really doing now I want to look at something and I have always liked doing this which is looking at correlations um so correlation between usually only numeric values we can do that by saying DF docr and a parenthesis and we’ll run this and what this is is it is comparing every column to every other column and looking at how closely correlated they are so this 2022 population if we look across the board it’s very highly I mean this is a a one: one this is highly correlated to each other and that almost for all of these populations they’re very very closely tied to each other which makes perfect sense because for most countries they’re going to be steadily increasing and so they’re probably almost exactly correlated but we can look at these populations and if you look at the area it’s only somewhat correlated and that’s because in some countries you know they have a very high population but a small area or vice versa small area in a very high population so there isn’t a one toone correlation there but it’s hard to really just glance at this um and understand everything that’s there we could just visualize it and it would be a lot easier so let’s go ahead and do that let’s go down here we’re just going to visualize this using a heat map basically so we’re going to say SNS do heatmap and an open parentheses and the data that we’re going to be looking at is DF do core correlation and then we also want to say inote equals true I’ll kind of show you what that looks like in just a little bit um but let’s do PLT doow and this will be our first look and I need to say show not shot um we can get a little glimpse of what it looks like but this looks um absolutely terrible let’s change the figure size really quickly so I want to make this much larger than it already is we’ll do PLT Dot RC pams RC pams oops right there do an open parenthesis and then right here we’re going to do in quotes do figure. fig size this actually needs to be in brackets I believe just like this not parentheses we’ll say fig size is equal to and now we can specify the value that we want let’s do 10 comma 7 and see if this looks any better no no that’s doesn’t look good do 20 okay that looks a lot better and um you know this is just a quick way because it gives you basically a colorcoded system highly correlated is this tan all the way down to basically no correlation or negative correlation even which is black so when we’re looking at these 2022 populations and these are populations right down here on this axis we can see that all of these are extremely highly correlated very very quick whereas the rank really has nothing to do it’s it’s negatively correlated doesn’t really have anything to do with it then for the population and the world population percentage it again is quite correlated except for the area density and growth rate so I find that really interesting that you know the density the growth rate in the area aren’t really all that Associated or correlated with the population numbers that is I kind of of would have assumed that on some level they went hand inand the area does um would you know again make sense you know larger area larger population that kind of thing but even density um I guess I guess density and growth rate um growth rate I can see because that’s a percentile thing that could be definitely not correlated but I thought the density would be more correlated than it is all that to say is this is one way that you can kind of look at your data see how correlated it is to one another that can definitely um help you know what to analyze and look at later when you’re actually doing your data analysis let’s go right down here um something that I do almost all the time when I’m doing any type of uh exploratory data analysis like this I’m going to group together columns start looking at the data a little bit closer um so let’s go ahead and group on the continent so let’s look at it right here let’s group on this continent because sometimes when you’re doing this Eda you already know kind of what the end goal of this data set is you know kind of what you’re looking for what you’re going to visualize at the end that you really comes in handy when doing this but sometimes you don’t sometimes just going in blind and so far we’ve really just been going in blind we’re just throwing things at the wind kind of seeing some overviews um looking at correlation that’s all we’ve done now I kind of want to get more specific I want to have like a use case something I’m kind of looking for not doing full data analysis not diving Into the Depths but something we can kind of aim for so the use case or the question for us is are are there certain continents that have grown faster than others and in which ways so we want to focus on these continents we know that that’s the most important column for this use case this very fake use case um so we can group on this continent and we can look at these populations right here because we can’t really see growth you can see a growth rate but the density per uh kilometer we don’t have multiple values for that it’s just a static one single value same for growth rate same for world population percentage but we have this over a long span many many years um you know 50 years of data here so this we can see which countries have really done well or which continents have really done well so without you know talking about it even more let’s do DF Group by and then we’ll say continent oops let me just copy this I’m I’m not good at spelling I’m going to say DF Group by and then we’ll do mean and we can just do it just like this and now we have Africa Asia Europe North America Oceana and South America okay so if I’m being completely honest I knew most of these all right I’m no geography extra expert but I I knew most of these I don’t know what this ocean is um this that I don’t I genuinely don’t know what that is um so let’s just search for that value and see we’ll come back up here in just a second but I want to I want to kind of understand um what this is so we’re going to DF um and we’ll say content let me sound that out for you guys um then we’ll do do string. contains oops contains good night and then I want to look for Oceana uh and let’s let’s run this oh I need to do it like this now let’s run this so now we’re looking at our data frame we’re seeing what the values have this continent as Oceana um okay so these look like Islands I’m guessing so we have Fiji Guam um New Zealand Papa New Guinea yeah these look like all I’m I’m guessing based off the continent Oceana um Oceania o ocean Oceania guys this is tough for me okay I’m doing my best I you know this is part of the Eda process I don’t know what that means I don’t know what Oceana ocean ocean Oceania geez I’m just going to call it Oceana that’s so wrong but I’m just gonna it’s so easy for me to say you know I I now am seeing this and it looks like Islands um which would make sense because for their average they have the highest average rank um and I’m guessing that’s because they’re just mostly small continents so let’s let’s order this really quickly we’re going to do dot sortore values do an open parenthesis and I want to sort on the population we’re just doing the average population um we’ll do BU um equal so on the average population and we’ll do ascending equals false so when we’re looking at this average or the mean population Asia has the highest population on average then we have South America Africa Europe North America and then Oceana at the very bottom which makes perfect sense again small Islands um world population percentage so each of the countries each of those countries in Asia makes up about 1% on average really interesting um to know and just kind of look at this and the density in Asia is far higher than d almost double every single other continent um really really interesting actually now that I’m looking at this but you know that’s something that I would actually look into and I would be like what is this Oceana or oenia what does that mean and you know let me look into that let me explore that more because I want to know this data set I’m trying to really understand this data set well but what I want to do now is I want to visualize this um because I just feel like looking at it I don’t it’s hard to visualize and again the use case that we’re saying is is which continent has grown the fastest like it could be percentage wise it could be um you know as just a whole on average let’s take a look so we’re going to take this and let’s copy it like this let’s bring this right down here so let’s look at this so if I try to visualize this and let’s do that let’s do df2 is equal to because I’m I already know it’s not going to look good just based off how the data’s sitting um we can do df2 oops what am I doing I don’t need to do that but I will okay df2 and we’ll do df2 do lot and we’ll run it just like this um as you can see Asia South America Africa Europe North America Oceana we can kind of understand what’s happening but these are the actual um values that are being visualized not the continents which is what I wanted um in order to switch it and it’s actually pretty easy and this is something that um you know is good to know we can actually transpose it to where these these continents become the columns and the columns become the index and all we have to do is say df2 do transpose and we’ll do this parentheses right here and let’s just look at it and then we’ll save it so now all these columns are right here and all of the indexes are the columns so we’ll say df3 is equal to and I’m just doing that so I don’t you know write over the DF or my earlier data frames so now we have this data frame three so now let’s do data frame 3. plot and it should look quite a bit different uh whoops I didn’t run this let’s run this and run this and as you can see this does not look right at all and the reason is because we’re not only looking at uh the correct columns we have this density in here we population percentage rank we don’t need any of those the only ones that we want to keep are these ones right here this population now we can do that and we can just go right up here this is where we created that data frame two that we transposed we can go right up here and we can specify within this we actually only want specific values now we can go through and handr write all of these and by all means go for it but I am going to go down here I’m going to say DF do columns and I’m going to run this it’s going to give us this list of all of our columns and I’m just going to you can just copy this and you can put it right in here I need a list with I think it needs to be like this if I’m let me try running this okay so this worked properly you can do it just like this or a little shortcut if you want to do it like that if you want to do a shortcut like um I I would hope you would you would just do DF do columns just like how we looked at down here except since this is our an index we can search through it so we can just say 0 1 two okay so we can do five up to 13 so I think it’s seven and we’ll just let’s see if this works uh it may not I may actually need to go like this let’s see there we go so you can just use you know the indexing to save you some visual space gives you the exact same output so now we have this this is our df2 now let’s go down and transpose it so now we just have these populations and we have our continents right here and then now we’re going to plot it and this looks good although it’s backward um okay it’s backward so what I actually want to do is not this uh that is a quick way to do it although not the best way to do it um so I’m actually going to copy all of these and although I said it would save us time it did not at all so I’m going to put a bracket right here I’m going to paste this in here and I’m literally going to change these up I might speed this up or I might just have you sit through this because you know this is an interesting part of the process and I want you know you to get the full experience you know what now that I’m talking about it that is what we’re going to do do you guys can hang out with me this is a good time we have 2010 2015 2020 and 2022 now let’s run it what did I do oh too many brackets there we go so now it’s ordered appropriately we have 1970 all the way up to 2022 this is how we want it let’s transpose it appropriately let’s run it and now we basically have the inverted uh image of this now just at a glance and we haven’t done anything to this except for literally what we are looking at at a glance we can see that from 1970 China you know Asia and China are already in the lead by quite a bit and it continues to drastically go up especially in the 2000s like right here it explodes like just straight up then kind of starts going up and just leveling off every other continent especially oce ocean is just really low it it never has done a bunch let’s see look at green green has gone up um from you know Point let’s say 0.1 up to about 0.2 so they’ve almost doubled um in the last 50 years and again you can just get an overview a highlevel overview of each of these you know continents over the span of this time so this is kind of one way that we can you know look at that use case we’re not going to harp on that too long I just want to give you an example like you know when you’re looking at this sometimes you’ll have something in mind of what you’re looking for and you go exploring and just kind of find what’s out there and find what you see um the next thing I want to look at is a box plot now I personally I love box plots you know they’re really good for finding outliers and there’s a lot of outliers I already know this because the average the 25th 50 percentile are very low and then there’s some really just big outliers but for your data set it may not be that way and those outliers may be something that you really need to look into and box plots have been something that I’ve used a lot where I found those outliers that way and started to dig into the data to find those outliers and you know came across some stuff that I’m like oh I have to clean this up I have to go back to the source really um really really powerful and useful to be able to find these so all you have to do is DF dobox plot and let’s take a look at it and this already looks good as is maybe I’ll make it a little a little bit wider um let’s do fig size oops sorry fig size is equal to let’s try 20 by 10 um okay that didn’t help at all I apologize I thought I would but let’s keep going what this is showing us is that these little boxes down here which are actually usually much larger because you have a more equal distribution of of um numbers or values in the small value this is where our averages lie this number right here is the upper range and then all these values all these Open Circles those actually stand for outliers so we’re looking at the 2022 population there’s a lot of outliers now for our data set knowing our data set is really important outliers are to be expected especially when most countries or continents are small so we’re looking at you know all of these little dots are outlier countries um or outlier values which each value corresponds to a country so if this was a different data set I would be you know searching on these and trying to find these so that I can see what’s wrong with them if anything or if they are real um numbers like if this was Revenue everyone’s revenue is way down here and then there’s one company that’s making like 10 trillion dollar that’d be an outlier up here and it would definitely be something that you want to look into for our data set knowing that you know we’re looking at population this is more than acceptable you know oddly enough but that’s what box plots are really good for showing you some of those core tiles the upper and the lower um as well as denoting these points that fall outside of those normal ranges for you to look into so really really useful so now let’s go down here pull up our data frame again and we’ve kind of just zoomed into the whole Eda process there was one last thing that I wanted to show you and this is the very last thing that we’re going to look at we’re ending on really a low point if I’m being honest because the last kind of stuff was more much more exciting but there is something DF DOD types oops let’s do DF DOD types and we’ll run this now just like info it gave us these values but we’re actually able to search on these values now so these um object float and integer we can search on those which is really great because we can do include equal and we can use something like number and none of these are numbers right or none of them EXP say number but when we run it I’m getting an error series object not oh that’s because I’m doing um D types is for a series we need to do select underscore D types now let’s run this now it’s only returning um The Columns in this data frame where the data types are included in this number so you won’t see any you know country or any of those text or the strings if we want to do that we go in here and say object and run that and this is another really quick way where we can just filter those columns to look for specific whether it’s numeric um we could even do float in here and so now it’s not including that rank which was an integer so we can specify the type of data type and it’ll filter all of the columns based off of that which you know when you’re doing stuff like this you it is good to know what kind of data types you’re working with and look at just those types of data types because there might might be some type of analysis you want to perform on just that whether it’s numeric or just the string or integer columns within your data set so again ending on a low note I apologize um you know everything else that we looked at all those other things that we looked at are all things that I typically do in some way or another when I’m looking at a data set exploratory data analysis is really just the first look you’re looking at it you’re going to be cleaning it up doing the data cleaning process and then you’re going to be doing your actual data analysis actually finding those Trends and patterns and then visualizing it um in some way to find some kind of meaning or Insight or value from that data and again there’s a thousand different ways you can go about this it it does typically um you know depend on the data set but these are a lot of the ways that you’ll clean a lot of different data sets and so you know that’s why I went into the things that we looked at in this video so I hope that you guys liked it I hope that you enjoyed something in this tutorial if you like this video be sure to like And subscribe as well as check out all my other videos on pandas and Python and I will see you in the next video [Music]

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog


Discover more from Amjad Izhar Blog

Subscribe to get the latest posts sent to your email.

Comments

Leave a comment