Skip to main content

Data Preprocessing

Once we cloned the repository into our workspace using Git Integration, you'll find a Notebook which will do certain operations like Scraping the data from TMDB using an API key.

These Scraping of the data process will get all the Genres that are available in the database and fetch the movie overviews that are assigned with the genres. There are some duplicate movies, so we need to remove all of them and keep the movies associated with an overview. We should remove those records where the movie doesn't contain any Overview. The notebook will take care of all the pre-processing techniques.

After getting the genres and overviews for the movies, we need to preprocess the overviews because Machine Learning models won't get trained on the raw text data. So, we apply most commonly used techniques to clean the data first. Once the data is clean, we then go and convert the text into numbers/vectors. We are applying these transformations to achieve the same.

  • Count Vectorizer: It is used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for use in further text analysis).โ€‹

  • TF โ€“ IDF: It is also known as Term Frequency โ€“ Invert Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data).โ€‹

  • Bag of Words: The Bag-of-Words model, or BoW for short, is a way of extracting features from the text for use in modelling, such as with machine learning algorithms. The approach is very simple and flexible and can be used in a myriad of ways for extracting features from documents.

Let's pre-process the Overviews, following are the steps:

  1. Removing the unnecessary punctuations.
  2. Removing stop words.
  3. Removing duplicate records.
  4. Removing records where there are no overviews available.
  5. Applying the Transformation Techniques.
# Making a DataFrame with the above text content and Target columns.

content = [movie["overview"] for movie in moview_with_overviews]

final_data = pd.concat([pd.DataFrame(content, columns = ["overview"]), pd.DataFrame(Y, columns = mlb.classes_)], axis = 1)

# Removign Punctuations from the Above Overviews.

final_data["overview"] = final_data["overview"].apply(lambda x:remove_punctuation(x))

# Remvoing stop words from Overviews.

final_data["overview"] = final_data["overview"].apply(lambda x:nfx.remove_stopwrods)

vectorize = CountVectorizer(max_df = 0.95, min_df = 0.005)
X = vectorize.fit_transform(final_data["overview"].toarray())

tfidf_transformer = TfidfTransformer()

X_tfidf = tfidf_transformer.fit_transform(X).toarray()