3824 stories
·
3 followers

Building a Simple Search Engine That Actually Works

1 Share

Why Build Your Own?

Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks.

Sometimes you just want something that:

  • Works with your existing database
  • Doesn't require external services
  • Is easy to understand and debug
  • Actually finds relevant results

That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works.


The Core Idea

The concept is simple: tokenize everything, store it, then match tokens when searching.

Here's how it works:

  1. Indexing: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights
  2. Searching: When someone searches, we tokenize their query the same way, find matching tokens, and score the results
  3. Scoring: We use the stored weights to calculate relevance scores

The magic is in the tokenization and weighting. Let me show you what I mean.


Building Block 1: The Database Schema

We need two simple tables: index_tokens and index_entries.

index_tokens

This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer.

// index_tokens table structure
id | name    | weight
---|---------|-------
1  | parser  | 20     // From WordTokenizer
2  | parser  | 5      // From PrefixTokenizer
3  | parser  | 1      // From NGramsTokenizer
4  | parser  | 10     // From SingularTokenizer

Why store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches.

The unique constraint is on (name, weight), so the same token name can exist multiple times with different weights.

index_entries

This table links tokens to documents with field-specific weights.

// index_entries table structure
id | token_id | document_type | field_id | document_id | weight
---|----------|---------------|----------|-------------|-------
1  | 1        | 1             | 1        | 42          | 2000
2  | 2        | 1             | 1        | 42          | 500

The weight here is the final calculated weight: field_weight × tokenizer_weight × ceil(sqrt(token_length)). This encodes everything we need for scoring. We will talk about scoring later in the post.

We add indexes on:

  • (document_type, document_id) - for fast document lookups
  • token_id - for fast token lookups
  • (document_type, field_id) - for field-specific queries
  • weight - for filtering by weight

Why this structure? Simple, efficient, and leverages what databases do best.


Building Block 2: Tokenization

What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like ["parser"], ["par", "pars", "parse", "parser"], or ["par", "ars", "rse", "ser"] depending on which tokenizer we use.

Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos.

All tokenizers implement a simple interface:

interface TokenizerInterface
{
    public function tokenize(string $text): array;  // Returns array of Token objects
    public function getWeight(): int;               // Returns tokenizer weight
}

Simple contract, easy to extend.

Word Tokenizer

This one is straightforward—it splits text into individual words. "parser" becomes just ["parser"]. Simple, but powerful for exact matches.

First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace:

class WordTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Normalize: lowercase, remove special chars
        $text = mb_strtolower(trim($text));
        $text = preg_replace('/[^a-z0-9]/', ' ', $text);
        $text = preg_replace('/\s+/', ' ', $text);

Next, we split into words and filter out short ones:

        // Split into words, filter short ones
        $words = explode(' ', $text);
        $words = array_filter($words, fn($w) => mb_strlen($w) >= 2);

Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search.

Finally, we return unique words as Token objects:

        // Return as Token objects with weight
        return array_map(
            fn($word) => new Token($word, $this->weight),
            array_unique($words)
        );
    }
}

Weight: 20 (high priority for exact matches)

Prefix Tokenizer

This generates word prefixes. "parser" becomes ["par", "pars", "parse", "parser"] (with min length 4). This helps with partial matches and autocomplete-like behavior.

First, we extract words (same normalization as WordTokenizer):

class PrefixTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $minPrefixLength = 4,
        private int $weight = 5
    ) {}
    
    public function tokenize(string $text): array
    {
        // Normalize same as WordTokenizer
        $words = $this->extractWords($text);

Then, for each word, we generate prefixes from the minimum length to the full word:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Generate prefixes from min length to full word
            for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) {
                $prefix = mb_substr($word, 0, $i);
                $tokens[$prefix] = true; // Use associative array for uniqueness
            }
        }

Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token.

Finally, we convert the keys to Token objects:

        return array_map(
            fn($prefix) => new Token($prefix, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 5 (medium priority)

Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful.

N-Grams Tokenizer

This creates character sequences of a fixed length (I use 3). "parser" becomes ["par", "ars", "rse", "ser"]. This catches typos and partial word matches.

First, we extract words:

class NGramsTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $ngramLength = 3,
        private int $weight = 1
    ) {}
    
    public function tokenize(string $text): array
    {
        $words = $this->extractWords($text);

Then, for each word, we slide a window of fixed length across it:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Sliding window of fixed length
            for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) {
                $ngram = mb_substr($word, $i, $this->ngramLength);
                $tokens[$ngram] = true;
            }
        }

The sliding window: for "parser" with length 3, we get:

  • Position 0: "par"
  • Position 1: "ars"
  • Position 2: "rse"
  • Position 3: "ser"

Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser".

Finally, we convert to Token objects:

        return array_map(
            fn($ngram) => new Token($ngram, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 1 (low priority, but catches edge cases)

Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos.

Normalization

All tokenizers do the same normalization:

  • Lowercase everything
  • Remove special characters (keep only alphanumerical)
  • Normalize whitespace (multiple spaces to single space)

This ensures consistent matching regardless of input format.


Building Block 3: The Weight System

We have three levels of weights working together:

  1. Field weights: Title vs content vs keywords
  2. Tokenizer weights: Word vs prefix vs n-gram (stored in index_tokens)
  3. Document weights: Stored in index_entries (calculated: field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Final Weight Calculation

When indexing, we calculate the final weight like this:

$finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength));

For example:

  • Title field: weight 10
  • Word tokenizer: weight 20
  • Token "parser": length 6
  • Final weight: 10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600

Why use ceil(sqrt())? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use ceil() to round up to the nearest integer, keeping weights as whole numbers.

Tuning Weights

You can adjust weights for your use case:

  • Increase field weights for titles if titles are most important
  • Increase tokenizer weights for exact matches if you want to prioritize exact matches
  • Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less

You can see exactly how weights are calculated and adjust them as needed.


Building Block 4: The Indexing Service

The indexing service takes a document and stores all its tokens in the database.

The Interface

Documents that can be indexed implement IndexableDocumentInterface:

interface IndexableDocumentInterface
{
    public function getDocumentId(): int;
    public function getDocumentType(): DocumentType;
    public function getIndexableFields(): IndexableFields;
}

To make a document searchable, you implement these three methods:

class Post implements IndexableDocumentInterface
{
    public function getDocumentId(): int
    {
        return $this->id ?? 0;
    }
    
    public function getDocumentType(): DocumentType
    {
        return DocumentType::POST;
    }
    
    public function getIndexableFields(): IndexableFields
    {
        $fields = IndexableFields::create()
            ->addField(FieldId::TITLE, $this->title ?? '', 10)
            ->addField(FieldId::CONTENT, $this->content ?? '', 1);
        
        // Add keywords if present
        if (!empty($this->keywords)) {
            $fields->addField(FieldId::KEYWORDS, $this->keywords, 20);
        }
        
        return $fields;
    }
}

Three methods to implement:

  • getDocumentType(): returns the document type enum
  • getDocumentId(): returns the document ID
  • getIndexableFields(): builds fields with weights using fluent API

You can index documents:

  • On create/update (via event listeners)
  • Via commands: app:index-document, app:reindex-documents
  • Via cron (for batch reindexing)

How It Works

Here's the indexing process, step by step.

First, we get the document information:

class SearchIndexingService
{
    public function indexDocument(IndexableDocumentInterface $document): void
    {
        // 1. Get document info
        $documentType = $document->getDocumentType();
        $documentId = $document->getDocumentId();
        $indexableFields = $document->getIndexableFields();
        $fields = $indexableFields->getFields();
        $weights = $indexableFields->getWeights();

The document provides its fields and weights via the IndexableFields builder.

Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it:

        // 2. Remove existing index for this document
        $this->removeDocumentIndex($documentType, $documentId);
        
        // 3. Prepare batch insert data
        $insertData = [];

Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh.

Now, we process each field. For each field, we run all tokenizers:

        // 4. Process each field
        foreach ($fields as $fieldIdValue => $content) {
            if (empty($content)) {
                continue;
            }
            
            $fieldId = FieldId::from($fieldIdValue);
            $fieldWeight = $weights[$fieldIdValue] ?? 0;
            
            // 5. Run all tokenizers on this field
            foreach ($this->tokenizers as $tokenizer) {
                $tokens = $tokenizer->tokenize($content);

For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight:

                foreach ($tokens as $token) {
                    $tokenValue = $token->value;
                    $tokenWeight = $token->weight;
                    
                    // 6. Find or create token in index_tokens
                    $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight);
                    
                    // 7. Calculate final weight
                    $tokenLength = mb_strlen($tokenValue);
                    $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength)));
                    
                    // 8. Add to batch insert
                    $insertData[] = [
                        'token_id' => $tokenId,
                        'document_type' => $documentType->value,
                        'field_id' => $fieldId->value,
                        'document_id' => $documentId,
                        'weight' => $finalWeight,
                    ];
                }
            }
        }

Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query.

Finally, we batch insert everything:

        // 9. Batch insert for performance
        if (!empty($insertData)) {
            $this->batchInsertSearchDocuments($insertData);
        }
    }

The findOrCreateToken method is straightforward:

    private function findOrCreateToken(string $name, int $weight): int
    {
        // Try to find existing token with same name and weight
        $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?";
        $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative();
        
        if ($result) {
            return (int) $result['id'];
        }
        
        // Create new token
        $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)";
        $this->connection->executeStatement($insertSql, [$name, $weight]);
        
        return (int) $this->connection->lastInsertId();
    }
}

Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates.

The key points:

  • We remove old index first (handles updates)
  • We batch insert for performance (one query instead of many)
  • We find or create tokens (avoids duplicates)
  • We calculate final weight on the fly

Building Block 5: The Search Service

The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores.

How It Works

Here's the search process, step by step.

First, we tokenize the query using all tokenizers:

class SearchService
{
    public function search(DocumentType $documentType, string $query, ?int $limit = null): array
    {
        // 1. Tokenize query using all tokenizers
        $queryTokens = $this->tokenizeQuery($query);
        
        if (empty($queryTokens)) {
            return [];
        }

If the query produces no tokens (e.g., only special characters), we return empty results.

Why Tokenize the Query Using the Same Tokenizers?

Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches.

Example:

  • Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser"
  • Searching with only WordTokenizer creates token: "parser"
  • We'll find "parser", but we won't find documents that only have "par" or "pars" tokens
  • Result: Incomplete matches, missing relevant documents!

The solution: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches.

This is why the SearchService and SearchIndexingService both receive the same set of tokenizers.

Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate:

        // 2. Extract unique token values
        $tokenValues = array_unique(array_map(
            fn($token) => $token instanceof Token ? $token->value : $token,
            $queryTokens
        ));

Why extract values? We search by token name, not by weight. We need the unique token names to search for.

Then, we sort tokens by length (longest first). This prioritizes specific matches:

        // 3. Sort tokens (longest first - prioritize specific matches)
        usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a));

Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first.

We also limit the token count to prevent DoS attacks with huge queries:

        // 4. Limit token count (prevent DoS with huge queries)
        if (count($tokenValues) > 300) {
            $tokenValues = array_slice($tokenValues, 0, 300);
        }

Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted).

Now, we execute the optimized SQL query. The executeSearch() method builds the SQL query and executes it:

        // 5. Execute optimized SQL query
        $results = $this->executeSearch($documentType, $tokenValues, $limit);

Inside executeSearch(), we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects:

private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array
{
    // Build parameter placeholders for token values
    $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?'));
    
    // Build the SQL query (shown in full in "The SQL Query" section below)
    $sql = "SELECT sd.document_id, ... FROM index_entries sd ...";
    
    // Build parameters array
    $params = [
        $documentType->value,  // document_type
        ...$tokenValues,       // token values for IN clause
        $documentType->value,  // for subquery
        ...$tokenValues,       // token values for subquery
        $minTokenWeight,      // minimum token weight
        // ... more parameters
    ];
    
    // Execute query with parameter binding
    $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative();
    
    // Filter out results with low normalized scores (below threshold)
    $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05);
    
    // Convert to SearchResult objects
    return array_map(
        fn($result) => new SearchResult(
            documentId: (int) $result['document_id'],
            score: (float) $result['score']
        ),
        $results
    );
}

The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it.

The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection).

We'll see the full query in the next section.

The main search() method then returns the results:

        // 5. Return results
        return $results;
    }
}

The Scoring Algorithm

The scoring algorithm balances multiple factors. Let's break it down step by step.

The base score is the sum of all matched token weights:

SELECT 
    sd.document_id,
    SUM(sd.weight) as base_score
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
GROUP BY sd.document_id
  • sd.weight: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Why not multiply by st.weight? The tokenizer weight is already included in sd.weight during indexing. The st.weight from index_tokens is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight).

This gives us the raw score. But we need more than that.

We add a token diversity boost. Documents matching more unique tokens score higher:

(1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_score

Why? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost.

We also add an average weight quality boost. Documents with higher quality matches score higher:

(1.0 + LOG(1.0 + AVG(sd.weight))) * base_score

Why? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic.

We apply a document length penalty. Prevents long documents from dominating:

base_score / (1.0 + LOG(1.0 + doc_token_count.token_count))

Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty.

Finally, we normalize by dividing by the maximum score:

score / GREATEST(1.0, max_score) as normalized_score

This gives us a 0-1 range, making scores comparable across different queries.

The full formula looks like this:

SELECT 
    sd.document_id,
    (
        SUM(sd.weight) *                                  -- Base score
        (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- Token diversity boost
        (1.0 + LOG(1.0 + AVG(sd.weight))) /              -- Average weight quality boost
        (1.0 + LOG(1.0 + doc_token_count.token_count))   -- Document length penalty
    ) / GREATEST(1.0, max_score) as score                -- Normalization
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
INNER JOIN (
    SELECT document_id, COUNT(*) as token_count
    FROM index_entries
    WHERE document_type = ?
    GROUP BY document_id
) doc_token_count ON sd.document_id = doc_token_count.document_id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
    AND sd.document_id IN (
        SELECT DISTINCT document_id
        FROM index_entries sd2
        INNER JOIN index_tokens st2 ON sd2.token_id = st2.id
        WHERE sd2.document_type = ?
        AND st2.name IN (?, ?, ?)
        AND st2.weight >= ?  -- Ensure at least one token with meaningful weight
    )
GROUP BY sd.document_id
ORDER BY score DESC
LIMIT ?

Why the subquery with st2.weight >= ?? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token.

Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do.

If no results with weight 10, we retry with weight 1 (fallback for edge cases).

Converting IDs to Documents

The search service returns SearchResult objects with document IDs and scores:

class SearchResult
{
    public function __construct(
        public readonly int $documentId,
        public readonly float $score
    ) {}
}

But we need actual documents, not just IDs. We convert them using repositories:

// Perform search
$searchResults = $this->searchService->search(
    DocumentType::POST,
    $query,
    $limit
);

// Get document IDs from search results (preserving order)
$documentIds = array_map(fn($result) => $result->documentId, $searchResults);

// Get documents by IDs (preserving order from search results)
$documents = $this->documentRepository->findByIds($documentIds);

Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results.

The repository method handles the conversion:

public function findByIds(array $ids): array
{
    if (empty($ids)) {
        return [];
    }
    
    return $this->createQueryBuilder('d')
        ->where('d.id IN (:ids)')
        ->setParameter('ids', $ids)
        ->orderBy('FIELD(d.id, :ids)')  // Preserve order from IDs array
        ->getQuery()
        ->getResult();
}

The FIELD() function preserves the order from the IDs array, so documents appear in the same order as search results.


The Result: What You Get

What you get is a search engine that:

  • Finds relevant results quickly (leverages database indexes)
  • Handles typos (n-grams catch partial matches)
  • Handles partial words (prefix tokenizer)
  • Prioritizes exact matches (word tokenizer has highest weight)
  • Works with existing database (no external services)
  • Easy to understand and debug (everything is transparent)
  • Full control over behavior (adjust weights, add tokenizers, modify scoring)

Extending the System

Want to add a new tokenizer? Implement TokenizerInterface:

class StemmingTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Your stemming logic here
        // Return array of Token objects
    }
    
    public function getWeight(): int
    {
        return 15; // Your weight
    }
}

Register it in your services configuration, and it's automatically used for both indexing and searching.

Want to add a new document type? Implement IndexableDocumentInterface:

class Comment implements IndexableDocumentInterface
{
    public function getIndexableFields(): IndexableFields
    {
        return IndexableFields::create()
            ->addField(FieldId::CONTENT, $this->content ?? '', 5);
    }
}

Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control.


Conclusion

So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect.

The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says.

You own it, you control it, you can debug it. And that's worth a lot.

Read the whole story
emrox
4 hours ago
reply
Hamburg, Germany
Share this story
Delete

Learning Experience

1 Share

Learning Experience

And more learning.

Read the whole story
emrox
6 days ago
reply
Hamburg, Germany
Share this story
Delete

Noop Functions vs Optional Chaining: A Performance Deep Dive

1 Share

Discover why noop functions are significantly faster than optional chaining in JavaScript!

Hi Folks,

This week I want to talk about something that might surprise you: the performance cost of optional chaining in JavaScript. A question came up recently about whether using a noop function pattern is faster than optional chaining, and the answer might make you rethink some of your coding patterns.

After a pull request review I did, Simone Sanfratello created a comprehensive benchmark to verify some of my thinking on this topic, and the results were eye-opening.

The Setup

Let's start with a simple scenario. You have two approaches to handle optional function calls:

// Approach 1: Noop function
function noop() {}
function testNoop() {
  noop();
}

// Approach 2: Optional chaining
const a = {}
function testOptionalChaining() {
  a.b?.fn?.();
}

Both accomplish the same goal: they execute safely without throwing errors. But how do they compare performance-wise?

The Numbers Don't Lie

Simone and I ran comprehensive benchmarks with 5 million iterations to get precise measurements. The results were striking:

Test Case Ops/Second Relative to Noop
Noop Function Call 939,139,797 Baseline
Optional Chaining (empty object) 134,240,361 7.00x slower
Optional Chaining (with method) 149,748,151 6.27x slower
Deep Optional Chaining (empty) 106,370,022 8.83x slower
Deep Optional Chaining (with method) 169,510,591 5.54x slower

Yes, you read that right. Noop functions are 5.5x to 8.8x faster than optional chaining operations.

Why Does This Happen?

The performance difference comes down to what the JavaScript engine needs to do:

Noop function: Simple function call overhead. The V8 engine optimizes this extremely well - it's just a jump to a known address and back. In fact, V8 will inline trivial functions like noop, making them essentially zero-overhead. The function call completely disappears in the optimized code.

Optional chaining: Property lookup, null/undefined check, potentially multiple checks for chained operations, and then the function call. Each ?. adds overhead that V8 can't optimize away because it has to perform the null/undefined checks at runtime.

The deeper your optional chaining, the worse it gets. Triple chaining like a?.b?.c?.fn?.() is about 1.17x slower than single-level optional chaining.

A Real-World Pattern: Fastify's Logger

This is exactly why Fastify uses the abstract-logging module. When no logger is provided, instead of checking logger?.info?.() throughout the codebase, Fastify provides a noop logger object with all the logging methods as noop functions.

// Instead of this everywhere in the code:
server.logger?.info?.('Request received');
server.logger?.error?.('Something went wrong');

// Fastify does this:
const logger = options.logger || require('abstract-logging');
// Now just call it directly:
server.logger.info('Request received');
server.logger.error('Something went wrong');

This is an important technique: provide noops upfront rather than check for existence later. V8 inlines these noop functions, so when logging is disabled, you pay essentially zero cost. The function call is optimized away completely. But if you use optional chaining, you're stuck with the runtime checks every single time, and V8 can't optimize those away.

The TypeScript Trap

One of the reasons we see so much unnecessary optional chaining in modern codebases is TypeScript. TypeScript's type system encourages defensive coding by marking properties as potentially undefined, even when your runtime guarantees they exist. This leads developers to add ?. everywhere "just to be safe" and satisfy the type checker.

Consider this common pattern:

interface Config {
  hooks?: {
    onRequest?: () => void;
  }
}

function processRequest(config: Config) {
  config.hooks?.onRequest?.(); // Is this really needed?
}

If you know your config object always has hooks defined at runtime, you're paying the optional chaining tax unnecessarily. TypeScript's strictNullChecks pushes you toward this defensive style, but it comes at a performance cost. The type system can't know your runtime invariants, so it forces you to check things that might never actually be undefined in practice.

The solution? Use type assertions or better type modeling when you have runtime guarantees. Here's how:

// Instead of this:
config.hooks?.onRequest?.();

// Do this if you know hooks always exists:
config.hooks!.onRequest?.();

// Or even better, fix the types to match reality:
interface Config {
  hooks: {
    onRequest?: () => void;
    onResponse?: () => void;
  }
}

// Now you can write:
config.hooks.onRequest?.();

// Or if you control both and know onRequest exists, use a noop:
const onRequest = config.hooks.onRequest || noop;
onRequest();

Don't let TypeScript's pessimistic type system trick you into defensive code you don't need.

The Real-World Context

Before you rush to refactor all your optional chaining, let me add some important context:

Even the "slowest" optional chaining still executes at 106+ million operations per second. For most applications, this performance difference is completely negligible. You're not going to notice the difference unless you're doing this in an extremely hot code path.

Memory usage is also identical across both approaches - no concerns there.

My Recommendation

Don't premature optimize. Write your code with optional chaining where it makes sense for safety and readability. For most Node.js applications, including web servers and APIs, optional chaining is perfectly fine. The safety and readability benefits far outweigh the performance cost in 99% of cases.

However, noop functions make sense when you're in a performance-critical hot path or every microsecond counts. If you control the code and can guarantee the function exists, skipping the optional chaining overhead is a clear win. Think high-frequency operations, tight loops, or code that runs thousands of times per request. Even at a few thousand calls per request, that 5-8x performance difference starts to add up.

If profiling shows that a specific code path is a bottleneck, then consider switching to noop functions or other optimizations. Use optional chaining for dealing with external data or APIs where you don't control the structure, and use it in normal business logic where code readability and safety are priorities.

Remember: readable, maintainable code is worth more than micro-optimizations in most cases. But when those microseconds matter, now you know the cost.

Thanks to Simone Sanfratello for creating the benchmarks that confirmed these performance characteristics!

Read the whole story
emrox
12 days ago
reply
Hamburg, Germany
Share this story
Delete

apps.apple.com: App Store web version

1 Share

Comments

Read the whole story
emrox
12 days ago
reply
Hamburg, Germany
Share this story
Delete

A quote from @belligerentbarbies

1 Share

I'm worried that they put co-pilot in Excel because Excel is the beast that drives our entire economy and do you know who has tamed that beast?

Brenda.

Who is Brenda?

She is a mid-level employee in every finance department, in every business across this stupid nation and the Excel goddess herself descended from the heavens, kissed Brenda on her forehead and the sweat from Brenda's brow is what allows us to do capitalism. [...]

She's gonna birth that formula for a financial report and then she's gonna send that financial report to a higher up and he's gonna need to make a change to the report and normally he would have sent it back to Brenda but he's like oh I have AI and AI is probably like smarter than Brenda and then the AI is gonna fuck it up real bad and he won't be able to recognize it because he doesn't understand Excel because AI hallucinates.

You know who's not hallucinating?

Brenda.

@belligerentbarbies, on TikTok

Read the whole story
emrox
12 days ago
reply
Hamburg, Germany
Share this story
Delete

A new personal best today!

1 Share

Read the whole story
emrox
12 days ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories