- read

Natural Language Processing-Part 3

Nth direction 53

In this story, we will implement a natural language processing pipeline. This story is part of an ongoing series on NLP. In Part 1 and Part 2, we learned about the introduction of NLP and how NLP works. Theory Part of pipeline is present in Part 2.

We will be making a Spam Ham message filtering pipeline. We will be using the below tools and libraries for our pipeline.

  • Python 3.6
  • Juputer notebooks
  • Pandas
  • Scikit-Learn
  • NLTK(Natural Language Toolkit)

What is NLTK?

It is the most utilized package to handle NLP with Python. The suite of open-source tools was created to make NLP processes in Python easier to build. With this, we won't need to make everything from scratch as it provides essential tools that we can chain together to accomplish our goal.

Instructions to download the ntlk are present in http://www.nltk.org/install.html.

Reading in the text data

Usually, you will find the text data either in a semi-structured or unstructured format, i.e., it can be binary data or data with no delimiters or data with no indication of rows.

  • read_csv method is used in pandas to read the data. tsv, in the end, indicates the tab-separated values.
  • Sep means the text separator. In our case, we are using \t, i.e., tab separator.
  • header=None indicates that we don't want to use any header; by default, it will take the first row as a header.
  • We will set headers of the data using data.columns. We have used label and messages as column names.
  • .head() function is used to read the first five-column of the dataset.

1. Exploring the dataset

  • shape function gives the results in (rows, columns) format.
  • To check how many ham and spam are present-
    len(data[data[‘label’] == ‘spam’])
    In this, we take the complete dataset data, filter out the rows that have "label" as spam and then take the len of it to get the number of rows.
  • For missing values, we have used -
    isnull method will look at all the data present under label columns and will return True and False depending on whether any value is present or not, and .sum will sum all the null values.

2. Cleaning or Preprocessing of data

  • Removing Punctuation

data[‘messages’].apply(lambda x: remove_punctuation(x)) will apply remove_punctuation function on each row of “messages” column and store the result in “cleaned_message” column

In the remove punctuation function, we are iterating on each char of the word of messages, and if they are not part of the string.punctuation, we add them to the final list and then make it a string with the .join method.

  • Tokenization, i.e., splitting the sentence into words.

We are using the re for tokenization. You can use tokenize method of the nltk library for the same.

  • Remove stopwords

We are iterating over each word of the message, and if it is a part of stopwords, we are not adding it to the final list. You can see in the results that in the first row under tokenized_message, we have words like — ve, been, for, which are later removed under msg_nostop.

You can do the above three steps in one function only. I did it in 3 different functions to give you a better context of each function's output or compare the outcome of each function.

All the above steps are performed in one function->

  • Lemmatize

You can see that in the new column goes turned into go and lives turned into life, aids into aid, and so on.

3. Train Test Split

We will use other columns like the msg_length and punct_percentage, which have details of message length and punctuation percentage. Let's create these two columns first by->

We are splitting the data in the training set and test set before creating our model. As the name signifies, one will be used to teach the model, and the other will be used to evaluate the model.
In the train test split function, we'll pass in the X features, our columns, then label, and test_size.
test_size is what percent of our original dataset do we want to allocate to the test set. And the commonly used value there is 20%, so 0.2.
train_test_split function returns the output in four datasets.

4. Vectorization

There are many ways to encode text as integers to create feature vectors. We mentioned different types of vectors in our previous part. Let's explore in a bit more detailed way by implementing them in this part.

  • Count Vectorization

In the previous steps, we had applied the clean_text function with the use of lambda, but with count vectorization, you could pass the function name in the analyzer hyperparameter itself, and it'll apply that function all at the same time that it's vectorizing the text.
We have used fit_transform because it won't do anything to our data if we only fit it. It'll just train the vectorizer object to learn what words are in the corpus. To fit it and then transform our data, i.e., if we want to fit the vectorizer and then actually vectorize our data and turn them into feature vectors, we'll need to call fit_transform, which will do the fitting and transform our data. So what will be stored in X_counts is the vectorized version of the data.

The raw data output of the CountVectorizer is what's called a Sparse Matrix. So what is a Sparse Matrix? A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements. So, to print out the matrix, we have to expand this sparse matrix out to a collection of arrays and then create a data frame from that.

Just like we implemented the Count Vectorizer, we can similarly implement the TF-IDF.

The only difference between the TF-IDF vectorizer and the count vectorizer is what's in the actual cells, not the shape of the matrix itself. We will take a sample of the data and apply the same functionality we did above to show the difference. Also, we have changed it with to_array() to view the matrix and used get_feature_names to name the matrix columns instead of zeros and ones.

In our count vectorizer, we had regular integers indicating the count here, but here, you have decimals instead of regular integers in the cells. I have marked two values. This .2323 is likely more important than this .19605. What that means is, either 12 occurs more frequently in the 1st text message than 11 does in the 2nd text message, or it means 12 occurs less frequently across all the other text messages than 11 does across all the other text messages.

TF-IDF is a count vectorizer that considers the length of the document and how common the word is across other text messages.

Now that we have understood the implementation of the vectorizers, let's implement that on the training/test set that we created in the previous step(Step 3).

We have only used .fit in the 2nd line, so it's stored all the words from our training set, and it'll keep those to be used to create the columns once we transform both our training and our test set. In other words, all we did was use this training set to fit our vectorizer object, and then we stored that vectorizer object. We have transformed the data in the next two steps and created tf_idf_train and tf_idf_test. We have then concatenated this vectorized data back with msg_length and punct_percentage to give us our X features.

We have used reset_index to drop the old index because pandas concatenate on the base of indexes. This means that the DataFrame created with tfidf_train will come with a brand new set of indices, and X_train will be keeping the index from the original data set. Because the text messages are still in the same order and indices are not matching, we have just dropped the old one so that the index for X_train will match the index for this new data frame.

In the above step, we saw that the vectorizer on the entire data set generated over 8,000 features. In other words, it recognized over 8,000 unique words. But we got around 7,000 this time. That's because this vectorizer was fit only on the training data. It means that around 1,000 words in the test set won't be recognized by the vectorizer or ignored.

5. Modeling

max_depth indicates the length of each decision tree. n_jobs=-1 is to parallelize the process. We need to tell about pos_label in the score function if it is not binary and ours is spam.


  • Precision 100% means when the model identifies something as spam, it is spam.
  • The 83.1% recall means that of all the spam that has come into your email, 83.1% of that spam was placed correctly in the spam folder, which means that the other 16.9% went into your inbox.
  • 97.4% accuracy means that of all the emails that came into your email, spam or non-spam, they were identified correctly as one or the other, 97.4% of the time

Let's use the above model and try it on a custom text to check whether it is spam or ham.

We are using a random message and creating a df with it, and then using our above model, we are trying to predict whether it is a ham or spam.

It takes us to the end of our NLP series. Happy learning