emrox's blurblog

Ideas by Doug
Monday December 15^th, 2025 at 5:24 AM

Savage Chickens – Cartoons on Sticky Notes by Doug Savage

Ideas

And more gifts.

Read the whole story

emrox

5 days ago

reply

Hamburg, Germany

Manual: Spaces
Tuesday December 9^th, 2025 at 3:15 AM

Space (whitespace) is a whole group of glyphs, one of the most important and frequently-used. Any computer user knows space as the widest key on their keyboard, however the notion itself is much bigger and comprises multiple important typographic terms and ideas.

Space in general is a blank unprinted area, a counterform that separates letters, words, lines etc. In typography, there are several types of spaces: sinkage (space on a page above a textblock), indent (space before the paragraph), leading (vertical space), word spacing, and letter spacing. In this article, we will primarily focus on word spacing, i.e. the space as a glyph.

European languages did not use word spacing for a long time, it was not until the 7th century that word spacing entered Latin script. In the age of metal type, the space was a material, tangible object — a piece of metal that left no print. In the pre-digital era, most text blocks were justified, which required several spaces of different width. Those types of spacing were defined by the notion of em (or point size), which is height of the piece of metal litera

Diagram of a cast metal sort, c is point size

used for printing a character. For example, one em in a 12-point typeface is 12 points, whereas its en (half-em) spaces’ width is 6pt, third space (of an em) equals 4pt, and so on.

Whitespace characters in Gauge. Widths and correlations between spaces differ depending on the typeface

These types of spaces are still existent in the digital age, but they are mostly used by advanced typographers. Messengers, text editors, and other programs and applications most typically use only regular space.

Word space

Standard space, word space, space per se, is the symbol typed using the widest key on the keyboard.

In metal type, the size of standard space varied depending on the typographic tradition, in most cases the space was rather wide.

As a standard word space, metal composition used an en space, half the height of the point size, or em-square (in Cyrillic typography), while Latin space was equal to the third of the em space. Living Typography (2012)

In the early digitised fonts one often sees excessively wide spaces; probably, it was an attempt to imitate en space, or three-per-em space, which were used as the main spacing material in metal type. Such a space width can affect the typesetting rhythm and would seem redundant in modern typography.

Wide spacing is both physiologically unnecessary and makes the whole typeset structure reticulate, aesthetically ruining the page’s layout. If for some reason you can’t stick to en space size in this particular line, it’s better to scale down spacing using three-per-em spaces (that equal to the third of an em), or spaces of 3, or even 2 points. M. I. Schelkunov History, Technique, Art of Printing (1926)

A wide word spacing seems weird to an eye of the modern reader, and it is way too visible in texts

Today, word space width is specified by the typeface’s designer themselves, and it is one of the defining moments in designing a typeface, along with spacing, — texture and rhythm of the typeset are heavily dependent on word space width.

Many modern typographers are seeking to subject the space width to certain rules. For example, some type designers claim that the space should be equal to the bounding box of lowercase letter i. However, this rule can’t be universal: specifically, it definitely won’t work for typefaces where letter i is of unconventional design and proportions. In super large point sizes, spacing and word spaces are often intentionally reduced, as in such cases even the bounding box of the i can be too wide.

It used to be a rule of thumb for headline settings to leave a space between words that is just wide enough to fit in a lowercase i. For comfortable reading of long lines, the space between words should be much wider. Erik Spiekermann Stop stealing sheep & find out how type works (1993)

Depending on whether your typeface is serif or sans serif, it makes sense to take, or not to take, in consideration sidebearings of the glyph. It can be very different depending on style, too: with wide and light weights, there will be more unprinted area than with narrow and heavy weights, and this also applies to the space width.

There is no question but that wordspaces may not be too large, or that the line must appear to be an even, well-balanced whole. What applies to letterspaces also applies to wordspaces: they too are a function of the counters of the individual letters: the smaller these are, the smaller the wordspaces; the larger the counters, the larger the wordspaces. Jost Hochuli Detail in Typography (2008)

Blank space between words should be such as to ensure that words are visibly separated from each other — if spacing is wider, there will be holes between words, if smaller, it will be difficult to tell one word from another. You can’t measure space with a ruler, as everything depends on specific design or typeface.

Word spaces as set in Kazimir Text. The space width is good: words are separated from one another, the hierarchy of white space is maintained

If you increase word spacing, word spaces would conflict with leading, which makes it hard to navigate through the text

If you decrease the width of word space, it will affect legibility, as the words will blend together

Using double spaces is a technique inherited from the age of typewriters. It is strongly advisable to check a document for double spaces and replace those by single spaces.

Some of the recommendations learned by the educated typist are still now acquired habits wrongly used in digital documents; for instance, the use of three spaces after a period or two after the comma. There was just one space width available in the typewriter, so words and sentences were separated by the same distance. The double space was used to differentiate sentences and improve the readability of the text. María Ramos Silva Type design for typewriters: Olivetti (2015)

Additional spacing after a period is a questionable method in terms of readability. It can be assumed that in the age of typewriters additional space could have better separated sentences from one another in the context of monowidth typeface, yet monowidth period and space already form a larger gap than any space within the sentence. Since typewriters, typesetting tools have significantly improved over time, and today nobody will typeset in a monowidth typeface, unless it is absolutely necessary. So, currently, the use of double spaces is considered mauvais ton, i.e. bad manners, regardless of typeface.

American lawyer Matthew Butterick wrote a book on typography for lawyers, writers, and anyone who works with text. In the US, it is still very common among the older generation to use double spaces, so Matthew dedicated two entire chapters of his Practical Typography to this issue. Butterick tried to convince his audience by imaginary dialogues:

“If you approve of smaller word spaces in some situations, why do you insist on only one space between sentences, where a larger gap might be useful?” Because you’re already getting a larger gap. A sentence-ending word space typically appears next to a period. A period is mostly white space. So visually, the space at the end of a sentence already appears larger than a single word space. No need to add another. Matthew Butterick Butterick’s Practical Typography (2013)

Non-breaking Space

Non-breaking space is a space character that prevents an automatic line break at its position. For instance, in Russian and a number of other Central and Eastern European languages, non-breaking space serves to stick together a preposition and a word next to it, numbers and units of measurements, name and surname, etc.

Non-breaking space is supported by almost any text editing program, graphic design software, or browser, along with a standard space, so one shouldn’t forget to utilise it according to the typesetting rules of any given language.

In Russian language, non-breaking space shall connect the dash and its previous word (except for direct speech), prepositions with following words, initials with surname, abbreviations (such as i.e.), numero sign with numbers, numbers and units of measurements.

In English it is considered good manners to stick together not prepositions, but pronouns and articles with the following word. However, this rule is often neglected, especially when it comes to newspapers and magazines.

Professional typesetting software have spaces of non-standard widths. In InDesign, all additional spaces — em space, en space, thin space, etc. — are non-breaking.

Additional spaces

Standard space is used everywhere; it is supported by any word, text, or code processing app. Non-breaking space is supported almost anywhere as well. However, computer typesetting still possesses a number of spaces dating back to metal type, allowing for finer adjustment of white space if necessary.

If a font supports additional spaces, those can be fetched via glyphs palette or using clipboard. Most graphic software do not support those spaces; for example, Adobe Illustrator 2020 includes only four additional spaces: em space, en space, thin space, and hair space.

And there is a reason for that: neither Illustrator, nor Photoshop were designed for advanced typesetting and laying out books. However, in InDesign you can easily set any kind of space, and a skilled typographer will use those.

Em Space

A space equal to the height of the em square (point size.) In early serifs, the metal face of the capital М tended to be square — probably, thus the English name. Metal type often used em space as paragraph indent.

En Space

Half of the width of an em. Russian-language metal type composition considered it the main type of space, even though in word spacing, especially if the text is aligned to the left or right, it is excessively wide.

Three-per-em Space, Third Space

One third of an em space. Historically considered as the main space in Western European typography.

The first obligation of a good typesetter is to achieve a compact line image, something best accomplished by using three-to-em or three-space word spacing. In former times even roman was set much tighter than we do it today; the specimen sheet that contains the original of Garamond’s roman of 1592, printed in 14-point, shows a word spacing in all lines of 2 points only, which is one-seventh of an em! This means that we cannot call three-to-em word spacing particularly tight. Jan Tschichold The Form Of The Book (1975)

Quarter Space

One fourth of an em space. Some authors believe quarter space to be the primary word space.

For a normal text face in a normal text size, a typical value for the word space is a quarter of an em, which can be written M/4. (A quarter of an em is typically about the same as, or slightly more than, the set-width of the letter t.) Robert Bringhurst The Elements of Typographic Style (1992)

Thin Space

⅕ of an em space. It is common that thin space equals about half the standard one, which is why thin space is used where standard word space would be too wide. For example, thin space is often utilised for spacing a dash in cases where standard space is too wide. Thin space is also used for spacing initials, from each other and from the surname:

Standard space in Spectral is too wide to be used for spacing initials and dashes

Thin spaces look more neat, better connecting initials with a surname and two parts of a sentence with each other

French typographic tradition prescribes the use of either thin or hair spaces to space any two-part symbols: exclamation mark, question mark, semicolon, etc.

Regardless of the language, such glyphs as question mark and exclamation mark typically are very visible in lowercase, but they can get lost in an all-caps typeset — in this case, one should finely space them.

Sixth Space

The sixth space is used when the thin space is too large.

Hair Space

The narrowest of spaces. In metal type, it was equal to 1/10 of an em space, in the digital age it is mostly 1/ 24 of an em. It might be useful if a certain typeface’s punctuation marks have too tight sidebearings, but a thin space would be too wide. For example, you can use hair space to space dashes instead of thin one — everything depends on the sidebearings and the design of the particular typeface.

You should keep in mind that after you change font, selected space glyphs will remain, but their width can change, — and this will affect the texture.

Isn’t it ridiculous when a punctuation mark, relating to the entire preceding phrase, is tied to one last word of the said phrase? And, vice versa, how unfortunately it looks when there is a large gap between this mark and the previous word. As a matter of fact, it is about time type foundry workers started thinking about it and cast the punctuation mark with an extra sidebearing on its left. However, typefounders are not always, or rather rarely, that forethoughtful, and also they are used to cast all letters without generous sidebearings. During punching of matrices, the beauty of spacing punctuation marks is also barely remembered. Therefore, it is your burden and responsibility to fix this and even more it is the one of compositors. These latter dislike 1-pt spaces, however it is this very thin space that can save the typeset beauty in these situations. That is why, with punctuation marks , ;. … : ! ? you should insist on putting 1-pt (hair) space before those symbols — but only when those don’t have an extra sidebearing on their left. If you are in charge of purchasing a typeface for the printing establishment, regard this issue when ordering typefaces, make the foundry give consideration to the beauty of their work and this particular detail. M. I. Schelkunov History, Technique, Art of Printing (1926)

Spacing in justified texts

Full justification — that is, alignment of text to its both margins, — is still commonly used in books and magazines. When the text is justified, the width of word spaces is not constant, it is changing to distribute words to the entire width of the line. In this situation, the uniformity of spacing could be even more important than the very width of these spaces: evenly large spaces in the entire page are better than large spaces in only one line. That is why, no matter how optimised the typeface’s word spacing in terms of its width is, it will not be enough for typesetting a justified text. While in metal type all spaces were set manually, and a typesetter knew what space they should add for even typesetting, nowadays it’s a computer that defines the length of spaces for justified texts. The algorithm divides the remaining space into equal parts and adds them to regular spaces. In doing so, the algorithm ignores letters, syntax, and punctuation, which is why when typesetting justified texts one should always double-check and adjust spacing manually.

In Indesign, it is possible to set minimum and maximum word spacing width for fully justified text typesetting: the width of standard space is used as a basis 100 %, maximum is normally about 120 %, minimum is about 80 %.

If the text is justified, a reasonable minimum word space is a fifth of an em (M/5), and M/4 is a good average to aim for. A reasonable maximum in justified text is M/2. If it can be held to M/3, so much the better. But for loosely fitted faces, or text set in a small size, M/3 is often a better average to aim for, and a better minimum is M/4. In a line of widely letterspaced capitals, a word space of M/2 or more may be required. Robert Bringhurst The Elements of Typographic Style (1992)

Robert Bringhurst recommends choosing appropriate spaces based on an em. However, space is a relative value, so in justified texts you should consider not the width of some abstract em, but rather the width of space in particular font.

The optimal word space width in justified texts is ephemeral and changes depending on typeface, point size, line width, line spacing, and many other factors. That is why in Indesign you can’t set maximum and minimum values once and for all cases — you will have to choose the best possible options manually.

In setting justified texts, standard word space width becomes a fluctuating value. The fixed width space and all additional spaces with constant width can help better control the setting.

The more even are the gaps between words, the better <…>. In no case shall you allow a considerable disparity in space widths, while an insignificant difference won’t ruin the beauty of typesetting. Pyotr Kolomnin A Concise Account of Typography (1899)

Figure Space

Figure space, or numeric space, is used for typesetting tables and sheets. If a typeface is fitted with tabular figures, its figure space will be equal to the width of tabular figures. Figure space is a non-breaking one.

Normally, figure space is significantly wider than standard space, it will be helpful when you need to even a large amount of multi-digit numbers

Punctuation Space

In most cases, the width of this space is equal to the glyph width of a period or a colon. May be of use in making up numbers in tables where digits are defined by a spacing element instead of period or colon.

Narrow No-break Space

A thin space that prevents an automatic line break. The name of this symbol in Unicode causes additional confusion: Narrow in this case is the same thing as Thin, and Narrow Space has the same width as Thin Space does.

In some applications, such as InDesign, the simple regular thin space is non-breaking by default and is called with Thin Space. In other cases it’s a separate symbol, for example, the Web uses Narrow No-break Space.

Spaces in layout

The distribution of white space in text setting is a highly important factor, responsible for the neat design and the content’s clear structure. Many designers keep in mind correlation between point size, line width, and margins, but some tend to forget that word spacing is an equivalent factor of these relations.

Body text font, designed for smaller sizes, would require smaller spacing and word spaces if used to set a large headline. The point size gets more important in determining spacing and white unprinted area in general, than whether it is a text typeface or a display one.

It is also necessary to consider spacing when you’re dealing with particular elements of the text. For instance, small-caps or all-caps fragments quite often should be additionally spaced. Manual spacing is sometimes necessary in bold or italic styles, or even if no additional styles are applied at all.

Small-caps spacing in Charter is too tight by default, more white space is needed

In William, small caps are taken care of, this generous spacing doesn’t require additional adjustment

A text set in a quality typeface sometimes needs manual adjustment: standard word space in Guyot is clearly not enough for the of ‘i’ combination

White spaces in software

Most typically, in non-professional software and web services there are only standard and non-breaking spaces available. You might be able to set additional symbols using clipboard almost anywhere where Unicode is supported. That said, you have to check everytime: for example, at the time of writing this piece, Facebook allows for inserting additional symbols in its input field, but automatically replaces them while posting.

Speaking of the Web, additional spaces are available as HTML special characters: if you use them, your source code might become a bit cluttered, but that would allow you to control the placing of each non-standard space. Please note that different browsers might render spacing differently, and not so long ago some of them even ignored additional spaces, replacing them by regular ones. You should check on the correct display of additional spaces where you use it.

Two industry standards for text formatting and typesetting, InDesign and Quark Xpress, support all kinds of spaces. Today, type designers usually include at least thin and hair spaces. Their width might vary from one typeface to another — but the typographer, at least, has more control over the word spacing.

In InDesign, an additional space not included in the typeface would still be visible, but its width would be defined by the software with no regard to what kind of typeface it is. For example, hair space in 24pt size will be 1pt — both in a display face with tight spacing and in a text face with loose spacing.

Spaces calculated this way are not always suitable for your task. Depending on the typeface, the additional space width suggested by InDesign can be insufficient or excessive. And if you export the text with such spaces from InDesign to Figma, their width will most likely change — every software may have its own algorithms for calculating these values.

Be vigilant and trust your eye: it is not mathematical values that matter, but a convincing, reasonable relationship between the black and the white.

These dashes are spaced by hair spaces provided by the typeface

The typefaces above have no hair space, therefore its width is set automatically

With x-height and spacing that Arno Pro and RIA Text have, the InDesign’s hair space is good enough. Whereas in IBM Plex we perhaps should put thin space instead of a hair one

Whitespace characters are among the most important typographic elements. Alongside sidebearings, they define text rhythm and organise blocks of information. Disregard for white spaces can ruin relations between them: line and word spacing, word spacing and column-gap. In such case the reader wouldn’t be able to easily track the line and would have to put additional unless this is your intended goal, you should always consider how different sorts of white space work with each other.

Summary table

Non-breaking space	MacOS: Alt + Space Windows: Alt+0160 Unicode: U00A0 HTML:   Indesign: Type → Insert White Space → Nonbreaking Space or Alt + Cmnd + X in case you need a space of non-changing width, in a justified text layout: Type → Insert White Space → Nonbreaking Space (Fixed Width)
Thin space	Unicode: U2009 HTML:   Indesign: Type → Insert White Space → Thin Space or Shift + Alt + Cmnd + M
Thin non-breaking space (for Web)	Unicode: U202F HTML:
Em space	Unicode: U2003 HTML: &emsp; Indesign: Type → Insert White Space → Em Space
En space	Unicode: U2002 HTML: &ensp; Indesign: Type → Insert White Space → En Space
Third space	Unicode: U2004 HTML: &emsp13; Indesign: Type → Insert White Space → Third Space
Quarter space	Unicode: U2005 HTML: &emsp14; Indesign: Type → Insert White Space → Quarter Space
Sixth space	Unicode: U2006 HTML:   Indesign: Type → Insert White Space → Sixth Space
Hair space	Unicode: U200A HTML: &hairsp; Indesign: Type → Insert White Space → Hair Space
Figure space	Unicode: U2007 HTML: &numsp; Indesign: Type → Insert White Space → Figure Space
Punctuation space	Unicode: U2008 HTML: &puncsp; Indesign: Type → Insert White Space → Punctuation Space

References

In English

Kirill Belyayev, Whitespaces and zero width characters with buttons for copying to clipboard, short mnemonics and usage comments
Robert Bringhurst, The Elements of Typographic Style
Matthew Butterick, Butterick’s Practical Typography
Jost Hochuli, Detail in Typography
Yves Peters, Adventures in Space (fontshop.com)
María Ramos Silva, Type design for typewriters: Olivetti
Erik Spiekermann, Stop stealing sheep & find out how type works
Jan Tschichold, The Form Of The Book
Martin Wichary, Space Yourself (smashingmagazine.com)

In Russian

Pyotr Kolomnin, A Concise Account of Typography
Alexandra Korolkova, Living Typography
M. I. Schelkunov, History, Technique, Art of Printing
Alexei Yozhikov, (Nearly) Everything You Need To Know About Whitespace (habr.com)

Read the whole story

emrox

11 days ago

reply

Hamburg, Germany

The Greatest by Doug
Tuesday November 25^th, 2025 at 5:46 AM

Savage Chickens – Cartoons on Sticky Notes by Doug Savage

The Greatest

And more job satisfaction.

Read the whole story

emrox

25 days ago

reply

Hamburg, Germany

The Optimist by Doug
Thursday November 20^th, 2025 at 4:47 AM

Savage Chickens – Cartoons on Sticky Notes by Doug Savage

The Optimist

And more optimism.

Read the whole story

emrox

30 days ago

reply

Hamburg, Germany

Building a Simple Search Engine That Actually Works
Monday November 17^th, 2025 at 5:42 AM

Karboosx

Why Build Your Own?

Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks.

Sometimes you just want something that:

Works with your existing database
Doesn't require external services
Is easy to understand and debug
Actually finds relevant results

That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works.

The Core Idea

The concept is simple: tokenize everything, store it, then match tokens when searching.

Here's how it works:

Indexing: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights
Searching: When someone searches, we tokenize their query the same way, find matching tokens, and score the results
Scoring: We use the stored weights to calculate relevance scores

The magic is in the tokenization and weighting. Let me show you what I mean.

Building Block 1: The Database Schema

We need two simple tables: index_tokens and index_entries.

index_tokens

This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer.

// index_tokens table structure
id | name    | weight
---|---------|-------
1  | parser  | 20     // From WordTokenizer
2  | parser  | 5      // From PrefixTokenizer
3  | parser  | 1      // From NGramsTokenizer
4  | parser  | 10     // From SingularTokenizer

Why store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches.

The unique constraint is on (name, weight), so the same token name can exist multiple times with different weights.

index_entries

This table links tokens to documents with field-specific weights.

// index_entries table structure
id | token_id | document_type | field_id | document_id | weight
---|----------|---------------|----------|-------------|-------
1  | 1        | 1             | 1        | 42          | 2000
2  | 2        | 1             | 1        | 42          | 500

The weight here is the final calculated weight: field_weight × tokenizer_weight × ceil(sqrt(token_length)). This encodes everything we need for scoring. We will talk about scoring later in the post.

We add indexes on:

(document_type, document_id) - for fast document lookups
token_id - for fast token lookups
(document_type, field_id) - for field-specific queries
weight - for filtering by weight

Why this structure? Simple, efficient, and leverages what databases do best.

Building Block 2: Tokenization

What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like ["parser"], ["par", "pars", "parse", "parser"], or ["par", "ars", "rse", "ser"] depending on which tokenizer we use.

Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos.

All tokenizers implement a simple interface:

interface TokenizerInterface
{
    public function tokenize(string $text): array;  // Returns array of Token objects
    public function getWeight(): int;               // Returns tokenizer weight
}

Simple contract, easy to extend.

Word Tokenizer

This one is straightforward—it splits text into individual words. "parser" becomes just ["parser"]. Simple, but powerful for exact matches.

First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace:

class WordTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Normalize: lowercase, remove special chars
        $text = mb_strtolower(trim($text));
        $text = preg_replace('/[^a-z0-9]/', ' ', $text);
        $text = preg_replace('/\s+/', ' ', $text);

Next, we split into words and filter out short ones:

        // Split into words, filter short ones
        $words = explode(' ', $text);
        $words = array_filter($words, fn($w) => mb_strlen($w) >= 2);

Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search.

Finally, we return unique words as Token objects:

        // Return as Token objects with weight
        return array_map(
            fn($word) => new Token($word, $this->weight),
            array_unique($words)
        );
    }
}

Weight: 20 (high priority for exact matches)

Prefix Tokenizer

This generates word prefixes. "parser" becomes ["par", "pars", "parse", "parser"] (with min length 4). This helps with partial matches and autocomplete-like behavior.

First, we extract words (same normalization as WordTokenizer):

class PrefixTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $minPrefixLength = 4,
        private int $weight = 5
    ) {}
    
    public function tokenize(string $text): array
    {
        // Normalize same as WordTokenizer
        $words = $this->extractWords($text);

Then, for each word, we generate prefixes from the minimum length to the full word:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Generate prefixes from min length to full word
            for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) {
                $prefix = mb_substr($word, 0, $i);
                $tokens[$prefix] = true; // Use associative array for uniqueness
            }
        }

Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token.

Finally, we convert the keys to Token objects:

        return array_map(
            fn($prefix) => new Token($prefix, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 5 (medium priority)

Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful.

N-Grams Tokenizer

This creates character sequences of a fixed length (I use 3). "parser" becomes ["par", "ars", "rse", "ser"]. This catches typos and partial word matches.

First, we extract words:

class NGramsTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $ngramLength = 3,
        private int $weight = 1
    ) {}
    
    public function tokenize(string $text): array
    {
        $words = $this->extractWords($text);

Then, for each word, we slide a window of fixed length across it:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Sliding window of fixed length
            for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) {
                $ngram = mb_substr($word, $i, $this->ngramLength);
                $tokens[$ngram] = true;
            }
        }

The sliding window: for "parser" with length 3, we get:

Position 0: "par"
Position 1: "ars"
Position 2: "rse"
Position 3: "ser"

Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser".

Finally, we convert to Token objects:

        return array_map(
            fn($ngram) => new Token($ngram, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 1 (low priority, but catches edge cases)

Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos.

Normalization

All tokenizers do the same normalization:

Lowercase everything
Remove special characters (keep only alphanumerical)
Normalize whitespace (multiple spaces to single space)

This ensures consistent matching regardless of input format.

Building Block 3: The Weight System

We have three levels of weights working together:

Field weights: Title vs content vs keywords
Tokenizer weights: Word vs prefix vs n-gram (stored in index_tokens)
Document weights: Stored in index_entries (calculated: field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Final Weight Calculation

When indexing, we calculate the final weight like this:

$finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength));

For example:

Title field: weight 10
Word tokenizer: weight 20
Token "parser": length 6
Final weight: 10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600

Why use ceil(sqrt())? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use ceil() to round up to the nearest integer, keeping weights as whole numbers.

Tuning Weights

You can adjust weights for your use case:

Increase field weights for titles if titles are most important
Increase tokenizer weights for exact matches if you want to prioritize exact matches
Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less

You can see exactly how weights are calculated and adjust them as needed.

Building Block 4: The Indexing Service

The indexing service takes a document and stores all its tokens in the database.

The Interface

Documents that can be indexed implement IndexableDocumentInterface:

interface IndexableDocumentInterface
{
    public function getDocumentId(): int;
    public function getDocumentType(): DocumentType;
    public function getIndexableFields(): IndexableFields;
}

To make a document searchable, you implement these three methods:

class Post implements IndexableDocumentInterface
{
    public function getDocumentId(): int
    {
        return $this->id ?? 0;
    }
    
    public function getDocumentType(): DocumentType
    {
        return DocumentType::POST;
    }
    
    public function getIndexableFields(): IndexableFields
    {
        $fields = IndexableFields::create()
            ->addField(FieldId::TITLE, $this->title ?? '', 10)
            ->addField(FieldId::CONTENT, $this->content ?? '', 1);
        
        // Add keywords if present
        if (!empty($this->keywords)) {
            $fields->addField(FieldId::KEYWORDS, $this->keywords, 20);
        }
        
        return $fields;
    }
}

Three methods to implement:

getDocumentType(): returns the document type enum
getDocumentId(): returns the document ID
getIndexableFields(): builds fields with weights using fluent API

You can index documents:

On create/update (via event listeners)
Via commands: app:index-document, app:reindex-documents
Via cron (for batch reindexing)

How It Works

Here's the indexing process, step by step.

First, we get the document information:

class SearchIndexingService
{
    public function indexDocument(IndexableDocumentInterface $document): void
    {
        // 1. Get document info
        $documentType = $document->getDocumentType();
        $documentId = $document->getDocumentId();
        $indexableFields = $document->getIndexableFields();
        $fields = $indexableFields->getFields();
        $weights = $indexableFields->getWeights();

The document provides its fields and weights via the IndexableFields builder.

Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it:

        // 2. Remove existing index for this document
        $this->removeDocumentIndex($documentType, $documentId);
        
        // 3. Prepare batch insert data
        $insertData = [];

Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh.

Now, we process each field. For each field, we run all tokenizers:

        // 4. Process each field
        foreach ($fields as $fieldIdValue => $content) {
            if (empty($content)) {
                continue;
            }
            
            $fieldId = FieldId::from($fieldIdValue);
            $fieldWeight = $weights[$fieldIdValue] ?? 0;
            
            // 5. Run all tokenizers on this field
            foreach ($this->tokenizers as $tokenizer) {
                $tokens = $tokenizer->tokenize($content);

For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight:

                foreach ($tokens as $token) {
                    $tokenValue = $token->value;
                    $tokenWeight = $token->weight;
                    
                    // 6. Find or create token in index_tokens
                    $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight);
                    
                    // 7. Calculate final weight
                    $tokenLength = mb_strlen($tokenValue);
                    $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength)));
                    
                    // 8. Add to batch insert
                    $insertData[] = [
                        'token_id' => $tokenId,
                        'document_type' => $documentType->value,
                        'field_id' => $fieldId->value,
                        'document_id' => $documentId,
                        'weight' => $finalWeight,
                    ];
                }
            }
        }

Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query.

Finally, we batch insert everything:

        // 9. Batch insert for performance
        if (!empty($insertData)) {
            $this->batchInsertSearchDocuments($insertData);
        }
    }

The findOrCreateToken method is straightforward:

    private function findOrCreateToken(string $name, int $weight): int
    {
        // Try to find existing token with same name and weight
        $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?";
        $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative();
        
        if ($result) {
            return (int) $result['id'];
        }
        
        // Create new token
        $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)";
        $this->connection->executeStatement($insertSql, [$name, $weight]);
        
        return (int) $this->connection->lastInsertId();
    }
}

Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates.

The key points:

We remove old index first (handles updates)
We batch insert for performance (one query instead of many)
We find or create tokens (avoids duplicates)
We calculate final weight on the fly

Building Block 5: The Search Service

The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores.

How It Works

Here's the search process, step by step.

First, we tokenize the query using all tokenizers:

class SearchService
{
    public function search(DocumentType $documentType, string $query, ?int $limit = null): array
    {
        // 1. Tokenize query using all tokenizers
        $queryTokens = $this->tokenizeQuery($query);
        
        if (empty($queryTokens)) {
            return [];
        }

If the query produces no tokens (e.g., only special characters), we return empty results.

Why Tokenize the Query Using the Same Tokenizers?

Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches.

Example:

Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser"
Searching with only WordTokenizer creates token: "parser"
We'll find "parser", but we won't find documents that only have "par" or "pars" tokens
Result: Incomplete matches, missing relevant documents!

The solution: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches.

This is why the SearchService and SearchIndexingService both receive the same set of tokenizers.

Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate:

        // 2. Extract unique token values
        $tokenValues = array_unique(array_map(
            fn($token) => $token instanceof Token ? $token->value : $token,
            $queryTokens
        ));

Why extract values? We search by token name, not by weight. We need the unique token names to search for.

Then, we sort tokens by length (longest first). This prioritizes specific matches:

        // 3. Sort tokens (longest first - prioritize specific matches)
        usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a));

Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first.

We also limit the token count to prevent DoS attacks with huge queries:

        // 4. Limit token count (prevent DoS with huge queries)
        if (count($tokenValues) > 300) {
            $tokenValues = array_slice($tokenValues, 0, 300);
        }

Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted).

Now, we execute the optimized SQL query. The executeSearch() method builds the SQL query and executes it:

        // 5. Execute optimized SQL query
        $results = $this->executeSearch($documentType, $tokenValues, $limit);

Inside executeSearch(), we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects:

private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array
{
    // Build parameter placeholders for token values
    $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?'));
    
    // Build the SQL query (shown in full in "The SQL Query" section below)
    $sql = "SELECT sd.document_id, ... FROM index_entries sd ...";
    
    // Build parameters array
    $params = [
        $documentType->value,  // document_type
        ...$tokenValues,       // token values for IN clause
        $documentType->value,  // for subquery
        ...$tokenValues,       // token values for subquery
        $minTokenWeight,      // minimum token weight
        // ... more parameters
    ];
    
    // Execute query with parameter binding
    $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative();
    
    // Filter out results with low normalized scores (below threshold)
    $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05);
    
    // Convert to SearchResult objects
    return array_map(
        fn($result) => new SearchResult(
            documentId: (int) $result['document_id'],
            score: (float) $result['score']
        ),
        $results
    );
}

The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it.

The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection).

We'll see the full query in the next section.

The main search() method then returns the results:

        // 5. Return results
        return $results;
    }
}

The Scoring Algorithm

The scoring algorithm balances multiple factors. Let's break it down step by step.

The base score is the sum of all matched token weights:

SELECT 
    sd.document_id,
    SUM(sd.weight) as base_score
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
GROUP BY sd.document_id

sd.weight: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Why not multiply by st.weight? The tokenizer weight is already included in sd.weight during indexing. The st.weight from index_tokens is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight).

This gives us the raw score. But we need more than that.

We add a token diversity boost. Documents matching more unique tokens score higher:

(1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_score

Why? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost.

We also add an average weight quality boost. Documents with higher quality matches score higher:

(1.0 + LOG(1.0 + AVG(sd.weight))) * base_score

Why? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic.

We apply a document length penalty. Prevents long documents from dominating:

base_score / (1.0 + LOG(1.0 + doc_token_count.token_count))

Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty.

Finally, we normalize by dividing by the maximum score:

score / GREATEST(1.0, max_score) as normalized_score

This gives us a 0-1 range, making scores comparable across different queries.

The full formula looks like this:

SELECT 
    sd.document_id,
    (
        SUM(sd.weight) *                                  -- Base score
        (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- Token diversity boost
        (1.0 + LOG(1.0 + AVG(sd.weight))) /              -- Average weight quality boost
        (1.0 + LOG(1.0 + doc_token_count.token_count))   -- Document length penalty
    ) / GREATEST(1.0, max_score) as score                -- Normalization
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
INNER JOIN (
    SELECT document_id, COUNT(*) as token_count
    FROM index_entries
    WHERE document_type = ?
    GROUP BY document_id
) doc_token_count ON sd.document_id = doc_token_count.document_id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
    AND sd.document_id IN (
        SELECT DISTINCT document_id
        FROM index_entries sd2
        INNER JOIN index_tokens st2 ON sd2.token_id = st2.id
        WHERE sd2.document_type = ?
        AND st2.name IN (?, ?, ?)
        AND st2.weight >= ?  -- Ensure at least one token with meaningful weight
    )
GROUP BY sd.document_id
ORDER BY score DESC
LIMIT ?

Why the subquery with st2.weight >= ?? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token.

Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do.

If no results with weight 10, we retry with weight 1 (fallback for edge cases).

Converting IDs to Documents

The search service returns SearchResult objects with document IDs and scores:

class SearchResult
{
    public function __construct(
        public readonly int $documentId,
        public readonly float $score
    ) {}
}

But we need actual documents, not just IDs. We convert them using repositories:

// Perform search
$searchResults = $this->searchService->search(
    DocumentType::POST,
    $query,
    $limit
);

// Get document IDs from search results (preserving order)
$documentIds = array_map(fn($result) => $result->documentId, $searchResults);

// Get documents by IDs (preserving order from search results)
$documents = $this->documentRepository->findByIds($documentIds);

Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results.

The repository method handles the conversion:

public function findByIds(array $ids): array
{
    if (empty($ids)) {
        return [];
    }
    
    return $this->createQueryBuilder('d')
        ->where('d.id IN (:ids)')
        ->setParameter('ids', $ids)
        ->orderBy('FIELD(d.id, :ids)')  // Preserve order from IDs array
        ->getQuery()
        ->getResult();
}

The FIELD() function preserves the order from the IDs array, so documents appear in the same order as search results.

The Result: What You Get

What you get is a search engine that:

Finds relevant results quickly (leverages database indexes)
Handles typos (n-grams catch partial matches)
Handles partial words (prefix tokenizer)
Prioritizes exact matches (word tokenizer has highest weight)
Works with existing database (no external services)
Easy to understand and debug (everything is transparent)
Full control over behavior (adjust weights, add tokenizers, modify scoring)

Extending the System

Want to add a new tokenizer? Implement TokenizerInterface:

class StemmingTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Your stemming logic here
        // Return array of Token objects
    }
    
    public function getWeight(): int
    {
        return 15; // Your weight
    }
}

Register it in your services configuration, and it's automatically used for both indexing and searching.

Want to add a new document type? Implement IndexableDocumentInterface:

class Comment implements IndexableDocumentInterface
{
    public function getIndexableFields(): IndexableFields
    {
        return IndexableFields::create()
            ->addField(FieldId::CONTENT, $this->content ?? '', 5);
    }
}

Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control.

Conclusion

So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect.

The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says.

You own it, you control it, you can debug it. And that's worth a lot.

Read the whole story

emrox

33 days ago

reply

Hamburg, Germany

Learning Experience by Doug
Tuesday November 11^th, 2025 at 3:23 AM

Savage Chickens – Cartoons on Sticky Notes by Doug Savage

Learning Experience

And more learning.

Read the whole story

emrox

39 days ago

reply

Hamburg, Germany

Ideas by Doug Monday December 15th, 2025 at 5:24 AM

Manual: Spaces Tuesday December 9th, 2025 at 3:15 AM

Word space

Non-breaking Space

Additional spaces

Em Space

En Space

Three-per-em Space, Third Space

Quarter Space

Thin Space

Sixth Space

Hair Space

Spacing in justified texts

Figure Space

Punctuation Space

Narrow No-break Space

Spaces in layout

White spaces in software

Summary table

References

In English

In Russian

The Greatest by Doug Tuesday November 25th, 2025 at 5:46 AM

The Optimist by Doug Thursday November 20th, 2025 at 4:47 AM

Building a Simple Search Engine That Actually Works Monday November 17th, 2025 at 5:42 AM

Why Build Your Own?

The Core Idea

Building Block 1: The Database Schema

index_tokens

index_entries

Building Block 2: Tokenization

Word Tokenizer

Prefix Tokenizer

N-Grams Tokenizer

Normalization

Building Block 3: The Weight System

Final Weight Calculation

Tuning Weights

Building Block 4: The Indexing Service

The Interface

How It Works

Building Block 5: The Search Service

How It Works

Why Tokenize the Query Using the Same Tokenizers?

The Scoring Algorithm

Converting IDs to Documents

The Result: What You Get

Extending the System

Conclusion

Learning Experience by Doug Tuesday November 11th, 2025 at 3:23 AM

Ideas by Doug
Monday December 15^th, 2025 at 5:24 AM

Manual: Spaces
Tuesday December 9^th, 2025 at 3:15 AM

The Greatest by Doug
Tuesday November 25^th, 2025 at 5:46 AM

The Optimist by Doug
Thursday November 20^th, 2025 at 4:47 AM

Building a Simple Search Engine That Actually Works
Monday November 17^th, 2025 at 5:42 AM

Learning Experience by Doug
Tuesday November 11^th, 2025 at 3:23 AM