/
🧠

Data cleaning

Machine Learning

Before work with the dataset, we should clean and prepare the dataset. For example, delete duplicate or incomplete dataset row or delete outlier data. One more thing that we should do, separate input and output data (labels).

python
X = df.drop(columns=["OUTPUT"]) # delete OUTPUT column and set data to the new array
Y = df["OUTPUT"] # set OUTPUT column to new array

One more example about cleaning dataset and fill empty or Nan fildes (For more detail read this link)

python
import pandas as pd
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("Video_Games_Sales_as_at_22_Dec_2016.csv", na_values=missing_values)
X = df.drop(columns=["Platform","Name","Genre","Publisher", "Developer","Rating"])
Y = df["Genre"]
median_Critic_Score = df['Critic_Score'].median()
median_Critic_Count = df['Critic_Count'].median()
median_User_Score = df['User_Score'].median()
median_User_Count = df['User_Count'].median()
X['Critic_Score'].fillna(median_Critic_Score, inplace=True)
X['Critic_Count'].fillna(median_Critic_Count, inplace=True)
X['User_Score'].fillna(median_User_Score, inplace=True)
X['User_Count'].fillna(median_User_Count, inplace=True)
X['Year_of_Release'].fillna(2006.0, inplace=True)
Y.fillna("Sports", inplace=True)

Sum of null fildes with below command

python
print(df.isnull().sum())
output
Name 2
Platform 0
Year_of_Release 269
Genre 0
Publisher 54
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
Global_Sales 0
Critic_Score 8582
Critic_Count 8582
User_Score 9129
User_Count 9129
Developer 6623
Rating 6769
dtype: int64
Edit this page
logo
Hi Mohsen save your notes