Skip to main content
Version: 3.2

Data preprocessing

Once we cloned the repository into our workspace using Git Integration. You'll find a Notebook which will do certain operations like Scraping the data from TMDB using an API key.

These Scraping the data process will get all the Genres that are available in the database and fetch the movie overviews that assigned with the genres. There are some duplicate movies, So we need to remove all of them and keep the movies that associated with an overview. There will be no use if the movie dosen't contain any Overview. We need to make sure of that. Notebook will take care of all the Preprocessing things.

After getting the Genres and Overviews for the movies. We need to preprocess the Overviews because the Machine Learning won't get trained on the Raw text data. So we need to preprocess them like removing the unnecessary punctuations and convert them into some kind of features. We are doing these transformations in 3 different approaches.

  • Count Vectorizer : It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).โ€‹

  • TF โ€“ IDF : It is also known as Term Frequency โ€“ Invert Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data).โ€‹

  • Bag of Words : The Bag-of-Words model, or BoW for short, is a way of extracting features from text for use in modelling, such as with machine learning algorithms. The approach is very simple and flexible and can be used in a myriad of ways for extracting features from documents.โ€‹

Let's preprocess the Overviews, Like removing the unnecessary punctuations, removing stop words and doing the Transformation Techniques.

# Making a DataFrame with the above text content and Target columns.

content = [movie["overview"] for movie in moview_with_overviews]

final_data = pd.concat([pd.DataFrame(content, columns = ["overview"]), pd.DataFrame(Y, columns = mlb.classes_)], axis = 1)

# Removign Punctuations from the Above Overviews.

final_data["overview"] = final_data["overview"].apply(lambda x:remove_punctuation(x))

# Remvoing stop words from Overviews.

final_data["overview"] = final_data["overview"].apply(lambda x:nfx.remove_stopwrods)

vectorize = CountVectorizer(max_df = 0.95, min_df = 0.005)
X = vectorize.fit_transform(final_data["overview"].toarray())

tfidf_transformer = TfidfTransformer()

X_tfidf = tfidf_transformer.fit_transform(X).toarray()