TF-IDF is short for Term Frequency-Inverse Document Frequency. It is a vectorization technique used in the field of Natural Language Processing. Yes I know, it is a daunting looking phrase, but trust me, it’s a lot simpler than it sounds.
Uses of TF-IDF
Natural Language Processing or NLP is the field in Machine Learning that deals with processing natural language data including but not limited to user comments, reviews, sentiment analysis, and text translation.
The one common obstacle with all of the problem statements associated with NLP is that the input data is textual. A quick reminder for all of you guys from Machine Learning 101 – The algorithm requires numbers! It cannot process textual data. So, what now? How do we proceed?
One innovative technique introduced was Vectorization. It involved associating every sentence with a vector of numerical values and this vector became the input data that was fed into the model. TF-IDF is one such vectorizing technique. Now that we know “the what?” and “the why?”, let’s dive into “the how?”
There is a need for you guys to know just a little more before diving into the working of TF-IDF. Yes, I could have included it here but that would have made this article lengthy and boring. Don’t worry, it’s a very quick read. It covers all the basics of processing data in textual format. Here’s the link to our article – Processing Textual Data
Working of TF-IDF
TF-IDF is usually applied to a bunch of documents each consisting of a bunch of sentences to understand the significance of a word in a document and the set of documents. For ease of understanding, let us consider 3 documents and say that each of the documents has 1 sentence. Assume that each of these sentences has been passed through the tokenization and normalization techniques mentioned in the prerequisites article.
Document 1: run fast
Document 2: run slow
Document 3: walk fast run fast
Here we first calculate “Term Frequency” and then “Inverse Document Frequency” for each word and multiply these 2 values to obtain the vector for a sentence.
Let us first make a frequency table corresponding to all the unique words:
It is calculated as follows:
Term Frequency is directly proportional to the importance of a word. This measure gives equal importance to all the terms.
Inverse Document Frequency
IDF is calculated as follows:
IDF uses logarithmic functions to provide an inverse relationship between a word’s importance and its frequency. The value return by the IDF for a particular word is inversely proportional to its frequency in a document, i.e., the rarer the word, the more important it is to the document. (log10 here means – log to the base 10)
A product of both TF and IDF assigns the right balance and provides a number that can be associated with a word and hence enables us to represent the sentence as a vector.
|Sentence 1||1/2 x log10(3/3)||1/2 x log10(3/2)||0 x log10(3/2)||0 x log10(3/1)|
|Sentence 2||1/2 x log10(3/3)||0 x log10(3/2)||1/2 x log10(3/2)||0 x log10(3/1)|
|Sentence 3||1/4 x log10(3/3)||2/4 x log10(3/2)||0 x log10(3/2)||1/4 x log10(3/1)|
These vectors are then passed on to the appropriate machine learning model where each vector acts as a data point. The following is just one of the vectorization methods.
There are more methods like Bag of Word and Word2Vec. Refer to the links below:
That’s all for now folks! Happy Learning!