Search concept, with a man looking up at a search bar.
PHOTO: Song_about_summer

Placing keywords into website content can feel like arranging furniture in a house. Of course, you need a couch and kitchen table set in your house, but the arrangement of that furniture is what makes a house feel like a home. Choosing where keywords go within content will feel much like arranging furniture. 

To bring home a sense of great keyword usage for your SEO content strategy, apply a text classification to discover your most important keyword choices. Text classification using Term Frequency/Inverse Document Frequency (TF-IDF) analyzes the importance of words within a given set of words. When applied to web content, marketers can better identify what is being emphasized in their marketing text and adjust.

What Is TF-IDF and How Is TF-IDF Calculated?

TF-IDF is a text classification score that highlights how each word in a document is relevant. The relevance is based on the number of appearances of that word in the document. TF-IDF has been used for large research documents like white papers, with demonstrations using words from large novels.   

The TF-IDF score is a product of two separate calculations. The first calculation is the term frequency. Term frequency is a ratio that examines the keyword count against the overall word count. 

The second value is the inverse document frequency. This is a log scale calculation that compares the keyword against the total words from a document or corpus.

tf idf

Wikipedia notes variations of the TF-IDF formula. Each variation covers a different frequency or adds a weight to the score. But the overall effect is to factor TF and IDF together to form the TF-IDF score. The magnitude of that score indicates the significant of the keyword’s appearance in the document. If the keyword is common on a site, the TF-IDF will be small (0.02 or so). An infrequent keyword will result in a large TF-IDF value.

Related Article: How to Use Keyword Density in a Modern SEO Strategy

How TF-IDF Benefits SEO

Text classification consists of a variety of techniques, but TF-IDF has seen increased usage in marketing. The advent of digitizing commercial text has opened the technique for applications on website pages, landing pages, social media posts, hashtags and even translated text to identify how frequent a word is being applied across an entire set of text. In fact, Google, along with other search engines, uses a variation of TF-IDF in its algorithm.

For a SEO strategy, TF-IDF gives marketers a broader overview for adjusting keyword placement within a webpage copy or landing page content. As I explained in my post, keyword density places an emphasis on a ratio of words within one page, relying on the analyst's judgement to make placement decisions. A TF-IDF value accounts for the appearance of a word across documents.

Thus, marketers gain a sense of where a word appears within content. Imagine identifying content gaps among pages, where current keywords may be better placed on another page that can better rank in the top search results. A placement adjustment can prevent keyword cannibalization between similar page content and avoid keyword stuffing on one page.

Applying R Programming to find TF-IDF

If you consider the furniture arrangement analogy, you are using TF-IDF to determine if the keyword relevancy in the pages reflect what you want in a search engine to discovery and include in a query. So where does a marketer begin?

The first step is to gather the words from the content we want to analyze. This can be done several ways with open-source programming languages R programming or Python (for this example, I am using R programming). You can read a text file into the language or use an API to access a software containing the words you want to examine. In the example below, I am using a library called Readtext to read a text file into an object that the program can recognize and consequently analyze.

web content readtext

The object web_content in the example acts as a container, the document part of the TF-IDF, with the actual text appearing in a column associated with the object, text. Here is what that text looks like when it is imported.

tex import

This text is from a website page, used just to work on the example code. Note that it contains a few backslashes or minor character codes. Characters like that sometimes happen when transferring text from one medium to another.

The next step is to work with the raw text. The words must be separated from the sentence structure so that the program can identify them. This technique is called a "bag of words" (One side note: Researchers technically call the "bag" in white papers a corpus). The separation process applied to the text is called tokenization.  Tokenization is a programmatic function that treats each word individually, to treat the body of text as a "bag of words." The actual process can vary depending on the source text, but most steps involve making all the words lower case, identifying root words, and, removing characters that serve no meaning, such as the .com extensions from social media links and posts.

In this example, the functions in another R library called Tidytext, along with built-in functions in R, can provide the sorting process. In this case apply the unnest tokens() function from the Tidytext library to tokenize the text.

content words unnest function

Next step is applying the TF-IDF formula. This can be developed as a program with a few ways to calculate the formula. Fortunately, in R programming, the Tidytext library provides a ready-made TF-IDF function called bind_tf_idf() so that the user does not have to calculate the formula. Its parameters include the data set that is being examined, a column that contains the words being examined, a column with a document ID — in case you are combining words from several documents — and a column containing the document term counts. When run, the bind_tf_idf function yields columns of TF-IDF scores. You can then compare the scores to see if certain words are emphasized more.

tf idg bind output function

In this example you can see the TF, IDF, and TF-IDF score. For convenience I added a GitHub gist where you can download the script I created as a starting point.

Related Article: How to Improve SEO Through Keyword Mapping

Text Classification Choices Can Lead to Machine Learning for SEO

Marketers who use Python can turn to a library called TfidfVectorizer to create a similar analysis to the Tidytext in R program. To be more precise, either R or Python can be used to recreate the TF-IDF calculations.

Creating the TF-IDF formula in a program can be a bit of effort because an analyst must sort text against the data structure within the programming language.   

Yet, whether you choose to create a formula or use a library, you still have a golden opportunity. The major advantage of using either R or Python is making the text classification results easily available for other statistical analysis. TF-IDF can be applied repeatedly, such as comparing documents for similarity or for dynamic tokenization of words. Repeated analysis like these often turns into a machine learning application, in which a framework like PyTorch or TensorFlow can insert probability models into the process. The business opportunity is an accurate means for rapid analysis, comparing thousands of words across a high number of website pages.

Make Better SEO Keyword Strategy Across Your Website

Text classification gives you a richer SEO audit of the words that characterize your page content. Exploring word frequency against content in multiple pages will lead to more decisive SEO insights into inserting the words meant to be emphasized in a search query. Applying TF-IDF is just one more step that ensures your website or app will find a home in the right search query.