Understanding Keyword Search

At its core, keyword search is about exact matching. When you type a query into a search engine or database, the system looks for documents that contain those exact words. It's a straightforward concept: if your query is "database", the search engine finds all documents where the word "database" appears.

Unlike modern semantic search that understands meaning and context, keyword search is literal. It doesn't know that "database" and "data storage" are related concepts—it only knows to find the exact word you specified.

Let's see this in action with a simple example:

Scroll to activate ↓

The database stores all user information securely.

Query:

database

When the query "database" appears, the search system scans through the text and highlights every occurrence of that exact word. This is the fundamental mechanism of keyword search.

Real searches often use multiple words. Let's see how that works with a two-word query: "database system".

Beyond Matching: Ranking Results

Finding documents that contain your query words is just the first step. In any real search system, multiple documents will match your query. The crucial question becomes: which results are more relevant?

If ten documents all contain your query words, how does the search engine decide which one to show first? Should it be the document where the terms appear most frequently? Or should other factors matter? Let's explore how search systems quantify relevance.

Consider these two documents that both match our query:

Scroll to activate ↓

Query:

database system

Document A

The database system stores all user information securely.

Document B

Our new database system uses a distributed database architecture. This database design improves system performance.

But, how do we quantify which document is more relevant?

Measuring Relevance: TF-IDF

To rank search results, we need to measure how relevant a document is to a query. Two key concepts help us do this:

First, we count how many times the query word appears in the document. Intuitively, if "database" appears three times in Document B but only once in Document A, Document B seems more focused on databases. This is called term frequency.

Second, we consider how common the word is across all documents. If "database" appears in almost every document in our collection, it's not very distinctive—it won't help us identify which documents are truly about databases. But if it only appears in a few documents, those documents are probably specifically about databases. This is called inverse document frequency.

Together, these create TF-IDF: Term Frequency-Inverse Document Frequency—a fundamental algorithm in information retrieval.

Let's visualize how this works:

Scroll to activate ↓

Query:

database system

Document A

The database system stores all user information securely.

"database": 1

"system": 1

Document B

Our new database system uses a distributed database architecture. This database design improves system performance.

"database": 3

"system": 2

Entire Document Collection

The database system stores all user information securely.

Machine learning system requires large training datasets and computational resources.

Our new database system uses a distributed database architecture.

Web applications need efficient routing and middleware systems.

The API system returns JSON formatted responses to client requests.

Cloud infrastructure provides scalable database solutions for enterprises.

"database": 3/6 docs

"system": 5/6 docs

Calculating the TF-IDF Score

Now we combine these two measurements. The TF-IDF score is calculated by multiplying the term frequency by the inverse document frequency:

TF (Term Frequency) = How many times the term appears in the document
IDF (Inverse Document Frequency) = log(Total documents / Documents containing the term)
TF-IDF = TF × IDF

A high TF-IDF score means the term appears frequently in this document but rarely in others—making it a strong signal that this document is relevant to that query term.

Let's calculate the TF-IDF scores for both documents and see which ranks higher:

Scroll to activate ↓

Document A

For "database":

TF

1

×

IDF

log(6/3) = 0.30

=

TF-IDF

0.30

For "system":

TF

1

×

IDF

log(6/5) = 0.08

=

TF-IDF

0.08

Document A Total:

0.30 + 0.08

0.38

Document B

For "database":

TF

3

×

IDF

log(6/3) = 0.30

=

TF-IDF

0.90

For "system":

TF

2

×

IDF

log(6/5) = 0.08

=

TF-IDF

0.16

Document B Total:

0.90 + 0.16

1.06

Ranking:

Document B (1.06) > Document A (0.38)

Document B ranks higher because it has more occurrences of the query terms, resulting in a higher combined TF-IDF score.

Notice how "database" has a much higher TF-IDF score than "system" in both documents. This is because "system" appears in 5 out of 6 documents—it's too common to be distinctive. The low IDF penalizes common words, ensuring rare terms like "database" have more weight in determining relevance.

Document B's total score (1.06) is significantly higher than Document A's (0.38) because the query terms appear more frequently in Document B. This higher combined TF-IDF score means Document B ranks first in search results. This is exactly how keyword search systems quantify "relevance"—frequent occurrence of rare terms signals high relevance.

While modern search systems use more sophisticated algorithms (like BM25, which improves on TF-IDF), understanding TF-IDF gives you the foundation for how keyword-based search systems quantify relevance and rank results.