Skip to main content
Version: 4.4

Data Preprocessing

In the Data Preprocesing stage, we'll do the following operations with the data, like imputing the missing values, dropping features that doesn't have any valuable information, missing values, handling the object columns, encoding the categorical values, normalizing the continuous variables etc.

Examples:

# Imputing null values in home_ownership with mode.

data["home_ownership"] = data["home_ownership"].fillna(data["home_ownership"].mode()[0])


# Filling the dti(debt-to-income) column with median.

data["dti"] = data["dti"].fillna(data["dti"].median())

# Droping the last_major_derog_none column, because it got 95% null values.

data.drop("last_major_derog_none", axis = 1, inplace=True)

Like these we can do the necessary conversions that will keep our data clean and valuable.

One-Hot-Encoding.

# Seperating the object columns from all the columns.

obj_cols = [col for col in data.columns if col in data.select_dtypes(include = "O")]

# Once we seperated the Object columns. We can do the One Hot Encoding on them.

enc = OneHotEncoder(handle_unknown='ignore')

enc_df = pd.DataFrame(enc.fit_transform(data[obj_cols]).toarray(),columns = enc.get_feature_names(obj_cols))
with open('One_Hot_Encoder.pkl', 'wb') as files:
pickle.dump(enc, files)

enc_df = data.join(enc_df)

clean_data = enc_df.drop(obj_cols, axis = 1)

Now we've cleaned the entire dataset, We can store these features into the feature store.