A simple review of Term Frequency – Inverse Document Frequency

TF-IDF is short for Term Frequency-Inverse Document Frequency. It is a vectorization technique used in the field of Natural Language Processing. Yes I know, it is a daunting looking phrase, but trust me, it’s a lot simpler than it sounds.

An image displaying a few sentences

Uses of TF-IDF

Natural Language Processing or NLP is the field in Machine Learning that deals with processing natural language data including but not limited to user comments, reviews, sentiment analysis, and text translation. 

The one common obstacle with all of the problem statements associated with NLP is that the input data is textual. A quick reminder for all of you guys from Machine Learning 101 – The algorithm requires numbers! It cannot process textual data. So, what now? How do we proceed?

One innovative technique introduced was Vectorization. It involved associating every sentence with a vector of numerical values and this vector became the input data that was fed into the model. TF-IDF is one such vectorizing technique. Now that we know “the what?” and “the why?”, let’s dive into “the how?”

Prerequisites

There is a need for you guys to know just a little more before diving into the working of TF-IDF. Yes, I could have included it here but that would have made this article lengthy and boring. Don’t worry, it’s a very quick read. It covers all the basics of processing data in textual format. Here’s the link to our article – Processing Textual Data

Working of TF-IDF

TF-IDF is usually applied to a bunch of documents each consisting of a bunch of sentences to understand the significance of a word in a document and the set of documents. For ease of understanding, let us consider 3 documents and say that each of the documents has 1 sentence. Assume that each of these sentences has been passed through the tokenization and normalization techniques mentioned in the prerequisites article.

Document 1: run fast

Document 2: run slow

Document 3: walk fast run fast

Here we first calculate “Term Frequency” and then “Inverse Document Frequency” for each word and multiply these 2 values to obtain the vector for a sentence.

Let us first make a frequency table corresponding to all the unique words:

WordFrequency
run3
fast2
slow2
walk1
Frequency Table

Term Frequency

It is calculated as follows:

The formula for calculating the Term Frequency

Term Frequency is directly proportional to the importance of a word. This measure gives equal importance to all the terms.

runfastslowwalk
Sentence 11/21/200
Sentence 21/201/20
Sentence 31/42/401/4
The term Frequency for each term

Inverse Document Frequency

IDF is calculated as follows:

The formula for calculating the Inverse Document Frequency

IDF uses logarithmic functions to provide an inverse relationship between a word’s importance and its frequency. The value return by the IDF for a particular word is inversely proportional to its frequency in a document, i.e., the rarer the word, the more important it is to the document. (log10 here means – log to the base 10)

WordIDF
runlog10(3/3)
fastlog10(3/2)
slowlog10(3/2)
walklog10(3/1)
The Inverse Document Frequency corresponding to each term

Final Vectors

A product of both TF and IDF assigns the right balance and provides a number that can be associated with a word and hence enables us to represent the sentence as a vector.

runfastslowwalk
Sentence 11/2 x log10(3/3)1/2 x log10(3/2)0 x log10(3/2)0 x log10(3/1)
Sentence 21/2 x log10(3/3)0 x log10(3/2)1/2 x log10(3/2)0 x log10(3/1)
Sentence 31/4 x log10(3/3)2/4 x log10(3/2)0 x log10(3/2)1/4 x log10(3/1)
Final Vectors corresponding to each sentence

These vectors are then passed on to the appropriate machine learning model where each vector acts as a data point. The following is just one of the vectorization methods.

There are more methods like Bag of Word and Word2Vec. Refer to the links below:

https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223

That’s all for now folks! Happy Learning!

Similar Posts

One Comment

Leave a Reply

Your email address will not be published.