3829 stories
·
3 followers

Databases in 2025: A Year in Review

1 Share

Another year passes. I was hoping to write more articles instead of just these end-of-the-year screeds, but I almost died in the spring semester, and it sucked up my time. Nevertheless, I will go through what I think are the major trends and happenings in databases over the last year.

There were many exciting and unprecedented developments in the world of databases. Vibe coding entered the vernacular. The Wu-Tang Clan announced their time capsule project. Rather than raising one massive funding round this year instead of going public, Databricks raised two massive rounds instead of going public.

Meanwhile, other events were expected and less surprising. Redis Ltd. switched their license back one year after their rugpull (I called this shot last year). SurrealDB reported great benchmark numbers because they weren't flushing writes to disk and lost data. And Coldplay can break up your marriage. Astronomer did make some pretty good lemonade on that last one though.

Before I begin, I want to address the question I get every year in the comments about these articles. People always ask why I don't mention a particular system, database, or company in my analysis. I can only write about so many things, and unless something interesting/notable happened in the past year, then there is nothing to really discuss. But not all notable database events are appropriate for me to opine about. For example, the recent attempt to unmask the AvgDatabase CEO is fair game, but the MongoDB suicide lawsuit is decidedly not.

With that out of the way, let's do this. These articles are getting longer each year, so I apologize in advance.

Previous entries:

The Dominance of PostgreSQL Continues

I first wrote about how PostgreSQL was eating the database world in 2021. That trend continues unabated as most of the most interesting developments in the database world are happening once again with PostgreSQL. The DBMS's latest version (v18) dropped in November 2025. The most prominent feature is the new asynchronous I/O storage subsystem, which will finally put PostgreSQL on the path to dropping its reliance on the OS page cache. It also added support for skip scans; queries can still use multi-key B+Tree indexes even if they are missing the leading keys (i.e., prefix). There are some additional improvements to the query optimizer (e.g., removing superfluous self-joins).

Savvy database connoisseurs will be quick to point out that these are not groundbreaking features and that other DBMSs have had them for years. PostgreSQL is the only major DBMS still relying on the OS page cache. And Oracle has supported skip scans since 2002 (v9i)! You may wonder, therefore, why I am claiming that the hottest action in databases for 2025 happened with PostgreSQL?

The reason is that most of the database energy and activity is going into PostgreSQL companies, offerings, projects, and derivative systems.

Acquisitions + Releases:

In the last year, the hottest data start-up (Databricks) paid $1b for a PostgreSQL DBaaS company (Neon). Next, one of the biggest database companies in the world (Snowflake) paid $250m for another PostgreSQL DBaaS company (CrunchyData). Then, one of the biggest tech companies on the planet (Microsoft) launched a new PostgreSQL DBaaS (HorizonDB). Neon and HorizonDB follow Amazon Aurora's original high-level architecture from the 2010s, with a single primary node separating compute and storage. For now, Snowflake's PostgreSQL DBaaS uses the same core architecture as standard PostgreSQL because they built on Crunchy Bridge.

Distributed PostgreSQL:

All of the services I listed above are single-primary node architectures. That is, applications send writes to a primary node, which then sends those changes to secondary replicas. But in 2025, there were two announcements on new projects to create scale-out (i.e., horizontal partitioning) services for PostgreSQL. In June 2025, Supabase announced that it had hired Sugu, the Vitess co-creator and former PlanetScale co-founder/CTO, to lead the Multigres project to create sharding middleware for PostgreSQL, similar to how Vitess shards MySQL. Sugu left PlanetScale in 2023 and had to lie back in the cut for two years. He is now likely clear of any legal issues and can make things happen at Supabase. You know it is a big deal when a database engineer joins a company, and the announcement focuses more on the person than the system. The co-founder/CTO of SingleStore joined Microsoft in 2024 to lead HorizonDB, but Microsoft (incorrectly) did not make a big deal about it. Sugu joining Supabase is like Ol' Dirty Bastard (RIP) getting out on parole after two years and then announcing a new record deal on the first day of his release.

One month after the Multigres news dropped, PlanetScale announced its own Vitess-for-PostgreSQL project, Neki. PlanetScale launched its initial PostgreSQL DBaaS in March 2025, but the core architecture is single-node stock PostgreSQL with pgBouncer.

Commercial Landscape:

With Microsoft's introduction of HorizonDB in 2025, all major cloud vendors now have serious projects for their own PostgreSQL offerings. Amazon has offered Aurora PostgreSQL since 2017. Google put out AlloyDB in 2022. Even the old flip-phone IBM has had its cloud version of PostgreSQL since 2018. Oracle released its PostgreSQL service in 2023, though there is a rumor that its in-house PostgreSQL team was collateral damage in its MySQL OCI layoffs in September 2025. ServiceNow launched its RaptorDB service in 2024, based on its 2021 acquisition of Swarm64.

Yes, I know Microsoft bought Citus in 2019. Citus was rebranded as Azure Database for PostgreSQL Hyperscale in 2019 and was then renamed to Azure Cosmos DB for PostgreSQL in 2022. But then there is Azure Database for PostgreSQL with Elastic Clusters that also uses Citus, but it is not the same as the Citus-powered Azure Cosmos DB for PostgreSQL. Wait, I might be wrong about this. Microsoft discontinued Azure PostgreSQL Single Server in 2023, but kept Azure PostgreSQL Flexible Server. It is sort of like how Amazon could not resist adding "Aurora" to the DSQL's name. Either way, at least Microsoft was smart enough to keep the name for their new system to just "Azure HorizonDB" (for now).

There are still a few independent (ISV) PostgreSQL DBaaS companies. Supabase is likely the largest of these by the number of instances. Others include YugabyteDB, TigerData (née TimeScale), PlanetScale, Xata, PgEdge, and Nile. Other systems provide a Postgres-compatible front-end, but the back-end systems are not derived from PostgreSQL (e.g., CockroachDB, CedarDB, Spanner). Xata built its original architecture on Amazon Aurora, but this year, it announced it is switching to its own infrastructure. Tembo dropped its hosted PostgreSQL offering in 2025 to pivot to a coding agent that can do some database tuning. ParadeDB has yet to announce its hosted service. Hydra and PostgresML went bust in 2025 (see below), so they're out of the game. There are also hosting companies that offer PostgreSQL DBaaS alongside other systems, such as Aiven and Tessel.

Andy's HeadAndy's Take:

It is not clear who the next major buyer will be after Databricks and Snowflake bought PostgreSQL companies. Again, every major tech company already has a Postgres offering. EnterpriseDB is the oldest PostgreSQL ISV, but missed out on the two most significant PostgreSQL acquisitions in the last five years. But they can ride along with Bain Capital's jock for a while, I guess, or hope that HPE buys them, even though that partnership is from eight years ago. This M&A landscape is reminiscent of OLAP acquisitions in the late 2000s, when Vertica was the last one waiting at the bus stop after AsterData, Greenplum, and DATAllegro were acquired.

The development of the two competing distributed PostgreSQL projects (Multigres, Neki) is welcome news. These projects are not the first time somebody has attempted this: Greenplum, ParAccel, and Citus have been around for two decades for OLAP workloads. Yes, Citus supports OLTP workloads, but they started in 2010 with a focus on OLAP. For OLTP, 15 years ago, the NTT RiTaDB project joined forces with GridSQL to create Postgres-XC. Developers from Postgres-XC founded StormDB, which Translattice later acquired in 2013. Postgres-X2 was an attempt to modernize XC, but the developers abandoned that effort. Translattice open-sourced StormDB as Postgres-XL, but the project has been dormant since 2018. YugabyteDB came out in 2016 and is probably the most widely deployed sharded PostgreSQL system (and remains open-source!), but it is a hard fork, so it is only compatible with PostgreSQL v15. Amazon announced its own sharded PostgreSQL (Aurora Limitless) in 2024, but it is closed source.

The PlanetScale squad has no love for the other side and throws hands at Neon and Timescale. Database companies popping off at each other is nothing new (see Yugabyte vs. CockroachDB). I suspect we will see more of this in the future as the PostgreSQL wars heat up. I suggest that these smaller companies call out the big cloud vendors and not fight with each other.

MCP For Every Database!

If 2023 was the year every DBMS added a vector index, then 2025 was the year that every DBMS added support for Anthropic's Model Context Protocol (MCP). MCP is a standardized client-server JSON-RPC interface that lets LLMS interact with external tools and data sources without requiring custom glue code. An MCP server acts as middleware in front of a DBMS and exposes a listing of tools, data, and actions it provides. An MCP client (e.g., an LLM host such as Claude or ChatGPT) discovers and uses these tools to extend its models' capabilities by sending requests to the server. In the case of databases, the MCP server converts these queries into the appropriate database query (e.g., SQL) or administrative command. In other words, MCP is the middleman who keeps the bricks counted and the cream straight, so the database and LLMs trust each other enough to do business.

Anthropic announced MCP in November 2024, but it really took off in March 2025 when OpenAI announced it would support MCP in its ecosystem. Over the next few months, every DBMS vendor released MCP servers for all system categories: OLAP (e.g., ClickHouse, Snowflake, Firebolt, Yellowbrick), SQL (e.g., YugabyteDB, Oracle, PlanetScale), and NoSQL (e.g., MongoDB, Neo4j, Redis). Since there is no official Postgres MCP server, every Postgres DBaaS has released its own (e.g., Timescale, Supabase, Xata). The cloud vendors released multi-database MCP servers that can talk to any of their managed database services (e.g., Amazon, Microsoft, Google). Allowing a single gateway to talk to heterogeneous databases is almost, but not quite, a holy-grail federated database. As far as I know, each request in these MCP servers targets only a single database at a time, so the application is responsible for performing joins across sources.

Beyond the official vendor MCP implementations, there are hundreds of rando MCP server implementations for nearly every DBMS. Some of them attempt to support multiple systems (e.g., DBHub, DB MCP Server). DBHub put out a good overview of PostgreSQL MCP servers.

An interesting feature that has proven helpful for agents is database branching. Although not specific to MCP servers, branching allows agents to test database changes quickly without affecting production applications. Neon reported in July 2025 that agents create 80% of their databases. Neon was designed from the beginning to support branching (Nikita showed me an early demo when the system was still called "Zenith"), whereas other systems have added branching support later. See Xata's recent comparison article on database branching.

Andy's HeadAndy's Take:

On one hand, I'm happy that there is now a standard for exposing databases to more applications. But nobody should trust an application with unfettered database access, whether it is via MCP or the system's regular API. And it remains good practice only to grant minimal privileges to accounts. Restricting accounts is especially important with unmonitored agents that may start going wild all up in your database. This means that lazy practices like giving admin privileges to every account or using the same account for every service are going to get wrecked when the LLM starts popping off. Of course, if your company leaves its database open to the world while you cause the stock price of one of the wealthiest companies to drop by $600b, then rogue MCP requests are not your top concern.

From my cursory examination of a few MCP server implementations, they are simple proxies that translate the MCP JSON requests into database queries. There is no deep introspection to understand what the request aims to do and whether it is appropriate. Somebody is going to order 18,000 water cups in your application, and you need to make sure it doesn't crash your database. Some MCP servers have basic protection mechanisms (e.g., ClickHouse only allows read-only queries). DBHub provides a few additional protections, such as capping the number of returned records per request and implementing query timeouts. Supabase's documentation offers best-practice guidelines for MCP agents, but they rely on humans to follow them. And of course, if you rely on humans to do the right thing, bad things will happen.

Enterprise DBMSs already have automated guardrails and other safety mechanisms that open-source systems lack, and thus, they are better prepared for an agentic ecosystem. For example, IBM Guardium and Oracle Database Firewall identify and block anomalous queries. I am not trying to shill for these big tech companies. I know we will see more examples in the future of agents ruining lives, like accidentally dropping databases. Combining MCP servers with proxies (e.g., connection pooling) is an excellent opportunity to introduce automated protection mechanisms.

MongoDB, Inc. v. FerretDB Inc.

MongoDB has been the NoSQL stalwart for two decades now. FerretDB was launched in 2021 by Percona's top brass to provide a middleware proxy that converts MongoDB queries into SQL for a PostgreSQL backend. This proxy allows MongoDB applications to switch over to PostgreSQL without rewriting queries.

They coexisted for a few years before MongoDB sent FerretDB a cease-and-desist letter in 2023, alleging that FerretDB infringes MongoDB's patents, copyrights, and trademarks, and that it violates MongoDB's license for its documentation and wire protocol specification. This letter became public in May 2025 when MongoDB went nuclear on FerretDB by filing a federal lawsuit over these issues. Part of their beef is that FerretDB is out on the street, claiming they have a "drop-in replacement" for MongoDB without authorization. MongoDB's court filing has all the standard complaints about (1) misleading developers, (2) diluting trademarks, and (3) damaging their reputation.

The story is further complicated by Microsoft's announcement that it donated its MongoDB-compatible DocumentDB to the Linux Foundation. The project website mentions that DocumentDB is compatible with the MongoDB drivers and that it aims to "build a MongoDB compatible open source document database". Other major database vendors, such as Amazon and Yugabyte, are also involved in the project. From a cursory glance, this language seems similar to what MongoDB is accusing FerretDB of doing.

Andy's HeadAndy's Take:

I could not find an example of a database company suing another one for replicating their API. The closest is Oracle suing Google for using a clean-room copy of the Java API in Android. The Supreme Court ultimately ruled in favor of Google on fair use grounds, and the case affected how re-implementation is treated legally.

I don't know how the lawsuit will play out if it ever goes to trial. A jury of random people off the street may not be able to comprehend the specifics of MongoDB's wire protocol, but they are definitely going to understand that the original name of FerretDB was MangoDB. It is going to be challenging to convince a jury that you were not trying to divert customers when you changed one letter in the other company's name. Never mind that it is not even an original name: there is already another DBMS called MangoDB that writes everything to /dev/null as a joke.

And while we are on the topic of database system naming, Microsoft's choice of "DocumentDB" is unfortunate. There are already Amazon DocumentDB (which, by the way, is also compatible with MongoDB, but Amazon probably pays for that), InterSystems DocDB, and Yugabyte DocDB. Microsoft's original name for "Cosmos DB" was also DocumentDB back in 2016.

Lastly, MongoDB's court filing claims they "...pioneered the development of 'non-relational' databases". This statement is incorrect. The first general-purpose DBMSs were non-relational because the relational model had not yet been invented. General Electric's Integrated Data Store (1964) used a network data model, and IBM's Information Management System (1966) used a hierarchical data model. MongoDB is also not the first document DBMS. That title goes to the object-oriented DBMSs from the late 1980s (e.g., Versant) or the XML DBMSs from the 2000s (e.g., MarkLogic). MongoDB is the most successful of these approaches by a massive margin (except maybe IMS).

File Format Battleground

File formats are an area of data systems that have been mostly dormant for the last decade. In 2011, Meta released a column-oriented format for Hadoop called RCFile. Two years later, Meta refined RCFile and announced the PAX-based ORC (Optimized Record Columnar File) format. A month after ORC's release, Twitter and Cloudera released the first version of Parquet. Nearly 15 years later, Paquet is the dominant file open-source format.

In 2025, there were five new open-source file formats released vying to dethrone Parquet:

These new formats joined the other formats released in 2024:

SpiralDB made the biggest splash this year with their announcement of donating Vortex to the Linux Foundation and the establishment of their multi-organization steering committee. Microsoft quietly killed off Amudai (or at least closed sourced it) at some point at the end of 2025. The other projects (FastLanes, F3, Anyblox) are academic prototypes. Anyblox won the VLDB Best Paper award this year.

This fresh competition has lit a fire in the Parquet developer community to modernize its features. See this in-depth technical analysis of the columnar file format landscape by Parquet PMC Chair (Julien Le Dem).

Andy's HeadAndy's Take:

The main problem with Parquet is not inherent in the format itself. The specification can and has evolved. Nobody expected organizations to rewrite petabytes of legacy files to update them to the latest Parquet version. The problem is that there are so many implementations of reader/writer libraries in different languages, each supporting a distinct subset of the specification. Our analysis of Paraquet files in the wild found that 94% of them use only v1 features from 2013, even though their creation timestamps are after 2020. This lowest common denominator means that if someone creates a Parquet file using v2 features, it is unclear whether a system will have the correct version to read it.

I worked on the F3 file format with brilliant people at Tsinghua (Xinyu Zeng, Huanchen Zhang), CMU (Martin Prammer, Jignesh Patel), and Wes McKinney. Our focus is on solving this interoperability problem by providing both native decoders as shared objects (Rust crates) and embedded WASM versions of those decoders in the file. If somebody creates a new encoding and the DBMS does not have a native implementation, it can still read data using the WASM version by passing Arrow buffers. Each decoder targets a single column, allowing a DBMS to use a mix of native and WASM decoders for a single file. AnyBlox takes a different approach, generating a single WASM program to decode the entire file.

I don't know who will win the file format war. The next battle is likely to be over GPU support. SpiralDB is making the right moves, but Parquet's ubiquity will be challenging to overcome. I also didn't even discuss how DuckLake seeks to upend Iceberg...

Of course, when this topic comes up, somebody always posts this xkcd comic on competing standards. I've seen it before. You don't need to email it to me again.

Random Happenings

Databases are big money. Let's go through them all!

Acquisitions:

Lots of movement on the block. Pinecone replaced its CEO in September to prepare for an acquisition, but I have not heard anything else about it. Here are the ones that did happen:

  • DataStax → IBM

    The Cassandra stalwart got picked up by IBM at the beginning of the year for an estimated $3b.

  • Quickwit → DataDog

    The leading company behind the Lucene replacement, Tantivy, a full-text search engine, was acquired at the beginning of the year. The good news is that Tantivy development continues unabated.

  • SDF → dbt

    This acquisition was a solid pick-up for dbt as part of their Fusion announcement this year. It allows them to perform more rigorous SQL analysis in their DAGs.

  • Voyage.ai → MongoDB

    Mongo picked up an early-stage AI company to expand its RAG capabilities in its cloud offering. One of my best students joined Voyage one week before the announcement. He thought he was going against the "family" by not signing with a database company, only to end up at one.

  • Neon → Databricks

    Apparently, there was a bidding war for this PostgreSQL company, but Databricks paid a mouthwatering $1b for it. Neon still exists today as a standalone service, but Databricks quickly rebranded it in its ecosystem as Lakebase.

  • CrunchyData → Snowflake

    You know Snowflake could not let Databricks get all the excitement during the summer, so they paid $250m for the 13-year-old PostgreSQL company CrunchyData. Crunchy had picked up top ex-Citus talent in recent years and was expanding its DBaaS offering before Snowflake wrote them a check. Snowflake announced the public preview of its Postgres service in December 2025.

  • Informatica → Salesforce

    The 1990s old-school ETL company Informatica got picked up by Salesforce for $8b. This is after they went public in 1999, reverted to PE in 2015, and went public again in 2021.

  • Couchbase → Private Equity

    To be honest, I never understood how Couchbase went public in 2021. I guess they were riding on MongoDB's coattails? Couchbase did interesting work a few years ago by incorporating components from the AsterixDB project at UC Irvine.

  • Tecton → Databricks

    Tecton provides Databricks with additional tooling to build agents. Another one of my former students was at the company and is now at Databricks.

  • Tobiko Data → Fivetran

    This team is behind two useful tools: SQLMesh and SQLglot. The former is the only viable open-source contender to dbt (see below for their pending merger with Fivetran). SQLglot is a handy SQL parser/deparser that supports a heuristic-based query optimizer. The combination of this in Fivetran and SDF with dbt makes for an interesting technology play in this space in the coming years.

  • SingleStore → Private Equity

    The PE firm buying SingleStore (Vector Capital) has prior experience in managing a database company. They previously purchased the XML database company MarkLogic in 2020 and flipped it to Progress in 2023.

  • Codership → MariaDB

    After getting bought by PE in 2024, the MariaDB Corporation went on a buying spree this year. The first up is the company behind the Galera Cluster scale-out middleware for MariaDB. See my 2023 overview of the MariaDB dumpster fire.

  • SkySQL → MariaDB

    And then we have the second MariaDB acquisition. Just so everyone is clear, the original commercial company backing MariaDB was called "SkySQL Corporation" in 2010, but it changed its name to "MariaDB Corporation" in 2014. Then in 2020, the MariaDB Corporation released a MariaDB DBaaS called SkySQL. But because they were hemorrhaging cash, the MariaDB Corporation spun SkySQL Inc. out as an independent company in 2023. And now, in 2025, MariaDB Corporation has come full circle by buying back SkySQL Inc. I did not have this move on my database bingo card this year.

  • Crystal DBA → Temporal

    The automated database optimization tool company heads off to Temporal to automatically optimize their databases! I'm happy to hear that Crystal's founder and Berkeley database group alumnus Johann Schleier-Smith is doing well there.

  • HeavyDB → Nvidia

    This system (formerly OmniSci, formerly MapD) was one of the first GPU-accelerated databases, launched in 2013. I couldn't find an official announcement of their closing, aside from an M&A firm listing the successful deal. And then we had a meeting with Nvidia to discuss potential database research collaborations, and some HeavyDB friends showed up.

  • DGraph → Istari Digital

    Dgraph was previously acquired by Hypermode in 2023. It looks like Istari just bought Dgraph and not the rest of Hypermode (or they ditched it). I still haven't met anybody who is actively using Dgraph.

  • DataChat → Mews

    This was one of the first "chat with your database" out of the University of Wisconsin and now CMU-DB professor Jignesh Patel. But they were bought by a European hotel management SaaS. Take that to mean what you think it means.

  • Datometry → Snowflake

    Datometry has been working on the perilous problem of automatically converting legacy SQL dialects (e.g., Teradata) to newer OLAP systems for several years. Snowflake picked them up to expand their migration tooling. See Datometry's 2020 CMU-DB tech talk for more info.

  • LibreChat → ClickHouse

    Like Snowflake buying Datometry, ClickHouse's acquisition here is a good example of improving the developer experience for high-performance commodity OLAP engines.

  • Mooncake → Databricks

    After buying Neon, Databricks bought Mooncake to enable PostgreSQL to read/write to Apache Iceberg data. See their November 2025 CMU-DB talk for more info.

  • Confluent → IBM

    This is the archetype of how to make a company out of a grassroots open-source project. Kafka was originally developed at Linkedin in 2011. Confluent was then spun out as a separate startup in 2014. They went IPO seven years later in 2021. Then IBM wrote a big check to take it over. Like with DataStax, it remains to be seen whether IBM will do to Confluent what IBM normally does with acquired companies, or whether they will be able to remain autonomous like RedHat.

  • Kuzu → ???

    The embedded graph DBMS out of the University of Waterloo was acquired by an unnamed company in 2025. The KuzuDB company then announced it was abandoning the open-source project. The LadybugDB project is an attempt at maintaining a fork of the Kuzu code.

Mergers:

Unexpected news dropped in October 2025 when Fivetran and dbt Labs announced they were merging to form a single company.

The last merger I can think of in the database space was the 2019 merger between Cloudera and Hortonworks. But that deal was just weak keys getting stepped on in a kitchen: two companies that were struggling to find market relevance with Hadoop merged into a single company to try to find it (spoiler: they did not). The MariaDB Corporation merger with Angel Pond Holdings Corporation in 2022 via a SPAC technically counts too, but that deal was so MariaDB could backdoor their way to IPO. And it didn't end well for investors. The Fivetran + dbt merger is different (and better) than these two. They are two complementary technology companies combining to become an ETL juggernaut, preparing for a legit IPO in the near future.

Funding:

Unless I missed them or they weren't announced, there were not as many early-stage funding rounds for database startups. The buzz around vector databases has muted, and VCs are only writing checks for LLM companies.

Name Changes:

A new category in my yearly write-up is database companies changing their names.

  • HarperDB → Harper

    The JSON database company dropped the "DB" suffix from its name to emphasize its positioning as a platform for database-backed applications, similar to Convex and Heroku. I like the Harper people. Their 2021 CMU-DB tech talk presented the worst DBMS idea I have ever heard. Thankfully, they ditched that once they realized how bad it was and switched to LMDB.

  • EdgeDB → Gel

    This was a smart move because the name "Edge" conveys that it is a database for edge devices or services (e.g., Fly.io). But I'm not sure "Gel" conveys the project's higher-level goals. See the 2025 talk on Gel's query language (still called EdgeQL) from CMU alums.

  • Timescale → TigerData

    This is a rare occurrence of a database company renaming itself to distinguish itself from its main database product. It is usually companies renaming themselves to be the name of the database (e.g., "Relational Software, Inc." to "Oracle Systems Corporation", "10gen, Inc." to "MongoDB, Inc."). But it makes sense for the company to try to shed the perception of being a specialized time-series DBMS instead of an improved version of PostgreSQL for general applications, since the former is a much smaller market segment than the latter.

Deaths:

In full disclosure, I was a technical advisor for two of these failed startups. My success rate as an advisor is terrible at this point. I was also an advisor for Splice Machine, but they closed shop in 2021. In my defense, I only talk with these companies about technical ideas, not business strategies. And I did tell Fauna they should add SQL support, but they did not take my advice.

  • Fauna

    An interesting distributed DBMS based on Dan Abadi's research for deterministic concurrency control. They provided strongly consistent transactions right when the NoSQL fade was waning, and Spanner made transactions cool again. But they had a proprietary query language and made big bets on GraphQL.

  • PostgresML

    The idea seemed obvious: enable people to run ML/AI operations inside of their PostgreSQL DBMS. The challenge was to convince people to migrate their existing databases to their hosted platform. They were pushing pgCat as a proxy to mirror database traffic. One of the co-founders joined Anthropic. The other co-founder created a new proxy project called pgDog.

  • Derby

    This is one of the first DBMSs written in Java, dating back to 1997 (originally called "Java DB" or "JBMS"). IBM donated it to the Apache Foundation in the 2000s, and it was renamed as Derby. In October 2025, the project announced that the system would enter "read-only mode" because no one was actively maintaining it anymore.

  • Hydra

    Although there is no official announcement for the DuckDB-inside-Postgres startup, the co-founders and employees have scattered to other companies.

  • MyScaleDB

    This was a fork of Clickhouse that adds vector search and full-text indexing using Tantivy. They announced they were closing in May 2025.

  • Voltron Data

    This was supposed to be the supergroup of database companies. Think of it like a Run the Jewels-level of heavy hitters. You had top engineers from Nvidia Rapids, the inventor of Apache Arrow and Python Pandas, and the Peruvian GPU wizards from BlazingSQL. Then throw in $110m in VC money from top firms that included the future CEO of Intel (and a board of trustees at Carnegie Mellon University). They built a GPU-accelerated database (Theseus), but failed to launch it in a timely manner.

Lastly, although not a business, I would be remiss not to mention the closing of IBM Research Almaden. IBM built this site in 1986 and was the database research mecca for decades. I interviewed at Almaden in 2013 and found the scenery to be beautiful. The IBM Research Database Group is not what it used to be. Still, the alum list of this hallowed database ground is impressive: Rakesh Agrawal, Donald Chamberlin, Ronald Fagin, Laura Haas, Mohan, Pat Selinger, Moshe Vardi, Jennifer Widom, and Guy Lohman.

Andy's HeadAndy's Take:

Somebody claimed that I judge the quality of a database based on how much funding the backing company raises for its development. This is obviously not true. I track these happenings because the database research game is crowded and high-energy. Not only am I "competing" against academics at other universities, but big tech companies and small start-ups are also putting out interesting systems I need to follow. The industry research labs are not what they used to be, except for Microsoft Research, which is still aggressively hiring top people and doing incredible work.

I predicted in 2022 that there would be a large number of database company closings in 2025. Yes, there were more closings this year than in previous years, but not at the scale I expected.

The death of Voltron and sort-of acquihire of HEAVY seem to continue the trend of the inviability of GPU-accelerated databases. Kinetica has been milking those government contracts for years, and Sqream still appears to be kicking it. These companies are still niche, and nobody has been able to make a significant dent in the dominance of CPU-powered DBMSs. I can't say who or what, but you will hear some major GPU-accelerated database announcements by vendors in 2026. It also provides further evidence of the commoditization of OLAP engines; modern systems have gotten so fast that the performance between them is negligible for low-level operations (scans, joins), so the things that differentiate one system from another are user experience and the quality of the query plans their optimizers generate.

The Couchbase and SingleStore acquisitions by private equity (PE) firms might signal a future trend in the database industry. Of course, PE acquisitions have happened before, but they all seem to be in recent times: (1) MarkLogic in 2020, (2) Cloudera in 2021, and (3) MariaDB in 2023. The only ones I can find before 2020 were SolidDB in 2007 and Informatica in 2015. PE acquisitions might replace the trend of plateaued database companies being bought by holding companies that milk the maintenance fees until eternity (Actian, Rocket). Even Oracle is still making money off RDB/VMS after buying them 30 years ago!

Lastly, props to Nikita Shamgunov. As far as I know, he is the only person to have co-founded two database companies (SingleStore and Neon) that were both acquired in a single year. Like when DMX (RIP) released two #1 albums in a single year (It's Dark and Hell Is Hot, Flesh of My Flesh), I don't think anybody is going to break Nikita's record any time soon.

Peak Male Performance

Talk about a banner year for the database OG Larry Ellison. The man turned 81 and accomplished more in one year than most people do in their lifetime. I will cover it all in chronological order.

Larry started the year ranked third-richest in the world. The idea that he would be worth less than Mark Zuckerberg was keeping him up at night. Some were saying Larry's insomnia was due to a diet change after he bought a famous British pub and was eating more pies. But I assure you that Larry's "veg-aquarian" diet has not changed in 30 years. Then, in April 2025, we got the news that Larry had become the second-richest person in the world. He started sleeping a little better, but it still wasn't good enough. There was also still a lot going on in his life that was stressing him out. For example, Larry finally decided to sell his rare, semi-road-legal McLaren F1 supercar, complete with the original owner's manual in the glovebox.

In July 2025, Larry graced us with this third tweet in 13 years (known as "#3" by Larry aficionados such as myself). This was an update about the Ellison Institute of Technology (EIT) that Larry established near the University of Oxford. With the name EIT and its association with Oxford, it sounds like it would be a pure research, non-profit institution, similar to Stanford's SRI or CMU's SEI. But it turns out to be an umbrella organization for a series of for-profit companies owned by a California-based limited liability company. Of course, a bunch of weirdos replied to #3 with promises of blockchain-powered cryogenic freezing or room-temperature superconductors. Larry told me he ignores those. Then there are people like this guy who get it.

The biggest database news of the year (possibly the century) hit us on Wednesday, September 10th, at approximately 3:00pm EST. After waiting for his turn for decades, Larry Joseph Ellison was finally anointed the richest person in the world. $ORCL shares rose by 40% that morning, and since Larry still owns 40% of the company, his estimated total worth is $393bn. From this perspective, this not only made the wealthiest person in the world, but also the richest person in the entire history of humanity. The peak net worths, adjusted for inflation, of John D. Rockefeller and Andrew Carnegie (yes, the 'C' in CMU) were only $340bn and $310bn, respectively.

On top of Larry's ascension to the top of the world, Oracle was also involved in the acquisition of the U.S. company controlling TikTok and Larry bankrolling Paramount (controlled by his son from his fourth marriage) bid to take over Warner Bros. The U.S. president even chided Larry to take control of CNN's news division since Larry is the majority shareholder of Paramount.

Andy's HeadAndy's Take:

I don't even know where to begin. Of course, when I found out that Larry Ellison had become the richest person in the world, all thanks to databases, I was heartened that something positive had finally happened in our lives. I don't care that Oracle's stock was artificially pumped up by splashy deals to build AI data centers instead of its traditional software business. I don't care that he dropped down the rankings after personally losing $130bn in two months. That's like you and me blowing a paycheck on FortuneCoins. It stings a little, and we had to eat rice and beans for two weeks mixed with expired hot sauce packets we took from Taco Bell, but we'll be alright.

Some people claim that Larry is out of touch with ordinary people. Or that he has lost his way because he is involved in things not directly related to databases. They point to things like his Hawaiian robot farm selling lettuce at $24/pound (€41/kg). Or that 81-year-old men don't have naturally blonde hair.

The truth is that Larry Ellison has conquered the enterprise database world, competitive sailing, and techbro wellness spas. The obvious next step is to take over a cable TV channel watched by thousands of people waiting in airports every day. Every time I talk with Larry, he makes it clear that he does not care one bit what people say or think about him. He knows his fans love him. His (new) wife loves him. And in the end, that's all that matters.

Conclusion

Before we close, I want to give some quick shout outs. First is to PT for keeping their database game tight with Turso in lockdown (see you on the outside). Condolences to JT for losing their job for trapping their KevoDB database sidepiece. My Ph.D. students and I also have a new start-up. I hope to say more on that soon. Word is bond.

Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
Share this story
Delete

Ideas

1 Share

Ideas

And more gifts.

Read the whole story
emrox
22 days ago
reply
Hamburg, Germany
Share this story
Delete

Manual: Spaces

1 Share

Space (whitespace) is a whole group of glyphs, one of the most important and frequently-used. Any computer user knows space as the widest key on their keyboard, however the notion itself is much bigger and comprises multiple important typographic terms and ideas.

Space in general is a blank unprinted area, a counterform that separates letters, words, lines etc. In typography, there are several types of spaces: sinkage (space on a page above a textblock), indent (space before the paragraph), leading (vertical space), word spacing, and letter spacing. In this article, we will primarily focus on word spacing, i.e. the space as a glyph.

European languages did not use word spacing for a long time, it was not until the 7th century that word spacing entered Latin script. In the age of metal type, the space was a material, tangible object — a piece of metal that left no print. In the pre-digital era, most text blocks were justified, which required several spaces of different width. Those types of spacing were defined by the notion of em (or point size), which is height of the piece of metal litera

Diagram of a cast metal sort, c is point size

used for printing a character. For example, one em in a 12-point typeface is 12 points, whereas its en (half-em) spaces’ width is 6pt, third space (of an em) equals 4pt, and so on.

1

Whitespace characters in Gauge. Widths and correlations between spaces differ depending on the typeface

These types of spaces are still existent in the digital age, but they are mostly used by advanced typographers. Messengers, text editors, and other programs and applications most typically use only regular space.

Word space

Standard space, word space, space per se, is the symbol typed using the widest key on the keyboard.

In metal type, the size of standard space varied depending on the typographic tradition, in most cases the space was rather wide.

As a standard word space, metal composition used an en space, half the height of the point size, or em-square (in Cyrillic typography), while Latin space was equal to the third of the em space. Living Typography (2012)

In the early digitised fonts one often sees excessively wide spaces; probably, it was an attempt to imitate en space, or three-per-em space, which were used as the main spacing material in metal type. Such a space width can affect the typesetting rhythm and would seem redundant in modern typography.

Wide spacing is both physiologically unnecessary and makes the whole typeset structure reticulate, aesthetically ruining the page’s layout. If for some reason you can’t stick to en space size in this particular line, it’s better to scale down spacing using three-per-em spaces (that equal to the third of an em), or spaces of 3, or even 2 points. M. I. Schelkunov History, Technique, Art of Printing (1926)

2

A wide word spacing seems weird to an eye of the modern reader, and it is way too visible in texts

Today, word space width is specified by the typeface’s designer themselves, and it is one of the defining moments in designing a typeface, along with spacing, — texture and rhythm of the typeset are heavily dependent on word space width.

Many modern typographers are seeking to subject the space width to certain rules. For example, some type designers claim that the space should be equal to the bounding box of lowercase letter i. However, this rule can’t be universal: specifically, it definitely won’t work for typefaces where letter i is of unconventional design and proportions. In super large point sizes, spacing and word spaces are often intentionally reduced, as in such cases even the bounding box of the can be too wide.

It used to be a rule of thumb for headline settings to leave a space between words that is just wide enough to fit in a lowercase i. For comfortable reading of long lines, the space between words should be much wider. Erik Spiekermann Stop stealing sheep & find out how type works (1993)

Depending on whether your typeface is serif or sans serif, it makes sense to take, or not to take, in consideration sidebearings of the glyph. It can be very different depending on style, too: with wide and light weights, there will be more unprinted area than with narrow and heavy weights, and this also applies to the space width.

There is no question but that wordspaces may not be too large, or that the line must appear to be an even, well-balanced whole. What applies to letterspaces also applies to wordspaces: they too are a function of the counters of the individual letters: the smaller these are, the smaller the wordspaces; the larger the counters, the larger the wordspaces. Jost Hochuli Detail in Typography (2008)

Blank space between words should be such as to ensure that words are visibly separated from each other — if spacing is wider, there will be holes between words, if smaller, it will be difficult to tell one word from another. You can’t measure space with a ruler, as everything depends on specific design or typeface.

3

Word spaces as set in Kazimir Text. The space width is good: words are separated from one another, the hierarchy of white space is maintained

4

If you increase word spacing, word spaces would conflict with leading, which makes it hard to navigate through the text

5

If you decrease the width of word space, it will affect legibility, as the words will blend together

Using double spaces is a technique inherited from the age of typewriters. It is strongly advisable to check a document for double spaces and replace those by single spaces.

Some of the recommendations learned by the educated typist are still now acquired habits wrongly used in digital documents; for instance, the use of three spaces after a period or two after the comma. There was just one space width available in the typewriter, so words and sentences were separated by the same distance. The double space was used to differentiate sentences and improve the readability of the text. María Ramos Silva Type design for typewriters: Olivetti (2015)

Additional spacing after a period is a questionable method in terms of readability. It can be assumed that in the age of typewriters additional space could have better separated sentences from one another in the context of monowidth typeface, yet monowidth period and space already form a larger gap than any space within the sentence. Since typewriters, typesetting tools have significantly improved over time, and today nobody will typeset in a monowidth typeface, unless it is absolutely necessary. So, currently, the use of double spaces is considered mauvais ton, i.e. bad manners, regardless of typeface.

American lawyer Matthew Butterick wrote a book on typography for lawyers, writers, and anyone who works with text. In the US, it is still very common among the older generation to use double spaces, so Matthew dedicated two entire chapters of his Practical Typography to this issue. Butterick tried to convince his audience by imaginary dialogues:

“If you approve of smaller word spaces in some situations, why do you insist on only one space between sentences, where a larger gap might be useful?” Because you’re already getting a larger gap. A sentence-ending word space typically appears next to a period. A period is mostly white space. So visually, the space at the end of a sentence already appears larger than a single word space. No need to add another. Matthew Butterick Butterick’s Practical Typography (2013)

Non-breaking Space

Non-breaking space is a space character that prevents an automatic line break at its position. For instance, in Russian and a number of other Central and Eastern European languages, non-breaking space serves to stick together a preposition and a word next to it, numbers and units of measurements, name and surname, etc.

Non-breaking space is supported by almost any text editing program, graphic design software, or browser, along with a standard space, so one shouldn’t forget to utilise it according to the typesetting rules of any given language.

In Russian language, non-breaking space shall connect the dash and its previous word (except for direct speech), prepositions with following words, initials with surname, abbreviations (such as i.e.), numero sign with numbers, numbers and units of measurements.

In English it is considered good manners to stick together not prepositions, but pronouns and articles with the following word. However, this rule is often neglected, especially when it comes to newspapers and magazines.

Professional typesetting software have spaces of non-standard widths. In InDesign, all additional spaces — em space, en space, thin space, etc. — are non-breaking.

Additional spaces

Standard space is used everywhere; it is supported by any word, text, or code processing app. Non-breaking space is supported almost anywhere as well. However, computer typesetting still possesses a number of spaces dating back to metal type, allowing for finer adjustment of white space if necessary.

If a font supports additional spaces, those can be fetched via glyphs palette or using clipboard. Most graphic software do not support those spaces; for example, Adobe Illustrator 2020 includes only four additional spaces: em space, en space, thin space, and hair space.

And there is a reason for that: neither Illustrator, nor Photoshop were designed for advanced typesetting and laying out books. However, in InDesign you can easily set any kind of space, and a skilled typographer will use those.

Em Space

A space equal to the height of the em square (point size.) In early serifs, the metal face of the capital М tended to be square — probably, thus the English name. Metal type often used em space as paragraph indent.

En Space

Half of the width of an em. Russian-language metal type composition considered it the main type of space, even though in word spacing, especially if the text is aligned to the left or right, it is excessively wide.

Three-per-em Space, Third Space

One third of an em space. Historically considered as the main space in Western European typography.

The first obligation of a good typesetter is to achieve a compact line image, something best accomplished by using three-to-em or three-space word spacing. In former times even roman was set much tighter than we do it today; the specimen sheet that contains the original of Garamond’s roman of 1592, printed in 14-point, shows a word spacing in all lines of 2 points only, which is one-seventh of an em! This means that we cannot call three-to-em word spacing particularly tight. Jan Tschichold The Form Of The Book (1975)

Quarter Space

One fourth of an em space. Some authors believe quarter space to be the primary word space.

For a normal text face in a normal text size, a typical value for the word space is a quarter of an em, which can be written M/4. (A quarter of an em is typically about the same as, or slightly more than, the set-width of the letter t.) Robert Bringhurst The Elements of Typographic Style (1992)

Thin Space

⅕ of an em space. It is common that thin space equals about half the standard one, which is why thin space is used where standard word space would be too wide. For example, thin space is often utilised for spacing a dash in cases where standard space is too wide. Thin space is also used for spacing initials, from each other and from the surname:

6

Standard space in Spectral is too wide to be used for spacing initials and dashes

7

Thin spaces look more neat, better connecting initials with a surname and two parts of a sentence with each other

French typographic tradition prescribes the use of either thin or hair spaces to space any two-part symbols: exclamation mark, question mark, semicolon, etc.

Regardless of the language, such glyphs as question mark and exclamation mark typically are very visible in lowercase, but they can get lost in an all-caps typeset — in this case, one should finely space them.

Sixth Space

The sixth space is used when the thin space is too large.

Hair Space

The narrowest of spaces. In metal type, it was equal to 1/10 of an em space, in the digital age it is mostly 1/ 24 of an em. It might be useful if a certain typeface’s punctuation marks have too tight sidebearings, but a thin space would be too wide. For example, you can use hair space to space dashes instead of thin one — everything depends on the sidebearings and the design of the particular typeface.

You should keep in mind that after you change font, selected space glyphs will remain, but their width can change, — and this will affect the texture.

Isn’t it ridiculous when a punctuation mark, relating to the entire preceding phrase, is tied to one last word of the said phrase? And, vice versa, how unfortunately it looks when there is a large gap between this mark and the previous word. As a matter of fact, it is about time type foundry workers started thinking about it and cast the punctuation mark with an extra sidebearing on its left. However, typefounders are not always, or rather rarely, that forethoughtful, and also they are used to cast all letters without generous sidebearings. During punching of matrices, the beauty of spacing punctuation marks is also barely remembered. Therefore, it is your burden and responsibility to fix this and even more it is the one of compositors. These latter dislike 1-pt spaces, however it is this very thin space that can save the typeset beauty in these situations. That is why, with punctuation marks , ;. … : ! ? you should insist on putting 1-pt (hair) space before those symbols — but only when those don’t have an extra sidebearing on their left. If you are in charge of purchasing a typeface for the printing establishment, regard this issue when ordering typefaces, make the foundry give consideration to the beauty of their work and this particular detail. M. I. Schelkunov History, Technique, Art of Printing (1926)

Spacing in justified texts

Full justification — that is, alignment of text to its both margins, — is still commonly used in books and magazines. When the text is justified, the width of word spaces is not constant, it is changing to distribute words to the entire width of the line. In this situation, the uniformity of spacing could be even more important than the very width of these spaces: evenly large spaces in the entire page are better than large spaces in only one line. That is why, no matter how optimised the typeface’s word spacing in terms of its width is, it will not be enough for typesetting a justified text. While in metal type all spaces were set manually, and a typesetter knew what space they should add for even typesetting, nowadays it’s a computer that defines the length of spaces for justified texts. The algorithm divides the remaining space into equal parts and adds them to regular spaces. In doing so, the algorithm ignores letters, syntax, and punctuation, which is why when typesetting justified texts one should always double-check and adjust spacing manually.

In Indesign, it is possible to set minimum and maximum word spacing width for fully justified text typesetting: the width of standard space is used as a basis 100 %, maximum is normally about 120 %, minimum is about 80 %.

If the text is justified, a reasonable minimum word space is a fifth of an em (M/5), and M/4 is a good average to aim for. A reasonable maximum in justified text is M/2. If it can be held to M/3, so much the better. But for loosely fitted faces, or text set in a small size, M/3 is often a better average to aim for, and a better minimum is M/4. In a line of widely letterspaced capitals, a word space of M/2 or more may be required. Robert Bringhurst The Elements of Typographic Style (1992)

Robert Bringhurst recommends choosing appropriate spaces based on an em. However, space is a relative value, so in justified texts you should consider not the width of some abstract em, but rather the width of space in particular font.

The optimal word space width in justified texts is ephemeral and changes depending on typeface, point size, line width, line spacing, and many other factors. That is why in Indesign you can’t set maximum and minimum values once and for all cases — you will have to choose the best possible options manually.

In setting justified texts, standard word space width becomes a fluctuating value. The fixed width space and all additional spaces with constant width can help better control the setting.

The more even are the gaps between words, the better <…>. In no case shall you allow a considerable disparity in space widths, while an insignificant difference won’t ruin the beauty of typesetting. Pyotr Kolomnin A Concise Account of Typography (1899)

Figure Space

Figure space, or numeric space, is used for typesetting tables and sheets. If a typeface is fitted with tabular figures, its figure space will be equal to the width of tabular figures. Figure space is a non-breaking one.

8

Normally, figure space is significantly wider than standard space, it will be helpful when you need to even a large amount of multi-digit numbers

Punctuation Space

In most cases, the width of this space is equal to the glyph width of a period or a colon. May be of use in making up numbers in tables where digits are defined by a spacing element instead of period or colon.

Narrow No-break Space

A thin space that prevents an automatic line break. The name of this symbol in Unicode causes additional confusion: Narrow in this case is the same thing as Thin, and Narrow Space has the same width as Thin Space does.

In some applications, such as InDesign, the simple regular thin space is non-breaking by default and is called with Thin Space. In other cases it’s a separate symbol, for example, the Web uses Narrow No-break Space.

Spaces in layout

The distribution of white space in text setting is a highly important factor, responsible for the neat design and the content’s clear structure. Many designers keep in mind correlation between point size, line width, and margins, but some tend to forget that word spacing is an equivalent factor of these relations.

Body text font, designed for smaller sizes, would require smaller spacing and word spaces if used to set a large headline. The point size gets more important in determining spacing and white unprinted area in general, than whether it is a text typeface or a display one.

It is also necessary to consider spacing when you’re dealing with particular elements of the text. For instance, small-caps or all-caps fragments quite often should be additionally spaced. Manual spacing is sometimes necessary in bold or italic styles, or even if no additional styles are applied at all.

9

Small-caps spacing in Charter is too tight by default, more white space is needed

10

In William, small caps are taken care of, this generous spacing doesn’t require additional adjustment

11

A text set in a quality typeface sometimes needs manual adjustment: standard word space in Guyot is clearly not enough for the of ‘i’ combination

White spaces in software

Most typically, in non-professional software and web services there are only standard and non-breaking spaces available. You might be able to set additional symbols using clipboard almost anywhere where Unicode is supported. That said, you have to check everytime: for example, at the time of writing this piece, Facebook allows for inserting additional symbols in its input field, but automatically replaces them while posting.

Speaking of the Web, additional spaces are available as HTML special characters: if you use them, your source code might become a bit cluttered, but that would allow you to control the placing of each non-standard space. Please note that different browsers might render spacing differently, and not so long ago some of them even ignored additional spaces, replacing them by regular ones. You should check on the correct display of additional spaces where you use it.

Two industry standards for text formatting and typesetting, InDesign and Quark Xpress, support all kinds of spaces. Today, type designers usually include at least thin and hair spaces. Their width might vary from one typeface to another — but the typographer, at least, has more control over the word spacing.

In InDesign, an additional space not included in the typeface would still be visible, but its width would be defined by the software with no regard to what kind of typeface it is. For example, hair space in 24pt size will be 1pt — both in a display face with tight spacing and in a text face with loose spacing.

Spaces calculated this way are not always suitable for your task. Depending on the typeface, the additional space width suggested by InDesign can be insufficient or excessive. And if you export the text with such spaces from InDesign to Figma, their width will most likely change — every software may have its own algorithms for calculating these values.

Be vigilant and trust your eye: it is not mathematical values that matter, but a convincing, reasonable relationship between the black and the white.

12

These dashes are spaced by hair spaces provided by the typeface

13

These dashes are spaced by hair spaces provided by the typeface

14

The typefaces above have no hair space, therefore its width is set automatically

15

With x-height and spacing that Arno Pro and RIA Text have, the InDesign’s hair space is good enough. Whereas in IBM Plex we perhaps should put thin space instead of a hair one

Whitespace characters are among the most important typographic elements. Alongside sidebearings, they define text rhythm and organise blocks of information. Disregard for white spaces can ruin relations between them: line and word spacing, word spacing and column-gap. In such case the reader wouldn’t be able to easily track the line and would have to put additional unless this is your intended goal, you should always consider how different sorts of white space work with each other.

Summary table

Non-breaking space MacOS: Alt + Space
Windows: Alt+0160
Unicode: U00A0
HTML: &nbsp;

Indesign: Type → Insert White Space → Nonbreaking Space or Alt + Cmnd + X
in case you need a space of non-changing width, in a justified text layout:
Type → Insert White Space → Nonbreaking Space (Fixed Width)

Thin space Unicode: U2009
HTML: &ThinSpace;

Indesign: Type → Insert White Space → Thin Space
or
Shift + Alt + Cmnd + M

Thin non-breaking space (for Web) Unicode: U202F
HTML: &#8239;
Em space Unicode: U2003
HTML: &emsp;

Indesign: Type → Insert White Space → Em Space

En space Unicode: U2002
HTML: &ensp;

Indesign: Type → Insert White Space → En Space

Third space Unicode: U2004
HTML: &emsp13;

Indesign: Type → Insert White Space → Third Space

Quarter space Unicode: U2005
HTML: &emsp14;

Indesign: Type → Insert White Space → Quarter Space

Sixth space Unicode: U2006
HTML: &#8198;

Indesign: Type → Insert White Space → Sixth Space

Hair space Unicode: U200A
HTML: &hairsp;

Indesign: Type → Insert White Space → Hair Space

Figure space Unicode: U2007
HTML: &numsp;

Indesign: Type → Insert White Space → Figure Space

Punctuation space Unicode: U2008
HTML: &puncsp;

Indesign: Type → Insert White Space → Punctuation Space

References

In English

Kirill Belyayev, Whitespaces and zero width characters with buttons for copying to clipboard, short mnemonics and usage comments
Robert Bringhurst, The Elements of Typographic Style
Matthew Butterick, Butterick’s Practical Typography
Jost Hochuli, Detail in Typography
Yves Peters, Adventures in Space (fontshop.com)
María Ramos Silva, Type design for typewriters: Olivetti
Erik Spiekermann, Stop stealing sheep & find out how type works
Jan Tschichold, The Form Of The Book
Martin Wichary, Space Yourself (smashingmagazine.com)

In Russian

Pyotr Kolomnin, A Concise Account of Typography
Alexandra Korolkova, Living Typography
M. I. Schelkunov, History, Technique, Art of Printing
Alexei Yozhikov, (Nearly) Everything You Need To Know About Whitespace (habr.com)

Read the whole story
emrox
28 days ago
reply
Hamburg, Germany
Share this story
Delete

The Greatest

1 Share

The Greatest

And more job satisfaction.

Read the whole story
emrox
42 days ago
reply
Hamburg, Germany
Share this story
Delete

The Optimist

2 Shares

The Optimist

And more optimism.

Read the whole story
emrox
47 days ago
reply
Hamburg, Germany
Share this story
Delete

Building a Simple Search Engine That Actually Works

1 Share

Why Build Your Own?

Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks.

Sometimes you just want something that:

  • Works with your existing database
  • Doesn't require external services
  • Is easy to understand and debug
  • Actually finds relevant results

That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works.


The Core Idea

The concept is simple: tokenize everything, store it, then match tokens when searching.

Here's how it works:

  1. Indexing: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights
  2. Searching: When someone searches, we tokenize their query the same way, find matching tokens, and score the results
  3. Scoring: We use the stored weights to calculate relevance scores

The magic is in the tokenization and weighting. Let me show you what I mean.


Building Block 1: The Database Schema

We need two simple tables: index_tokens and index_entries.

index_tokens

This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer.

// index_tokens table structure
id | name    | weight
---|---------|-------
1  | parser  | 20     // From WordTokenizer
2  | parser  | 5      // From PrefixTokenizer
3  | parser  | 1      // From NGramsTokenizer
4  | parser  | 10     // From SingularTokenizer

Why store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches.

The unique constraint is on (name, weight), so the same token name can exist multiple times with different weights.

index_entries

This table links tokens to documents with field-specific weights.

// index_entries table structure
id | token_id | document_type | field_id | document_id | weight
---|----------|---------------|----------|-------------|-------
1  | 1        | 1             | 1        | 42          | 2000
2  | 2        | 1             | 1        | 42          | 500

The weight here is the final calculated weight: field_weight × tokenizer_weight × ceil(sqrt(token_length)). This encodes everything we need for scoring. We will talk about scoring later in the post.

We add indexes on:

  • (document_type, document_id) - for fast document lookups
  • token_id - for fast token lookups
  • (document_type, field_id) - for field-specific queries
  • weight - for filtering by weight

Why this structure? Simple, efficient, and leverages what databases do best.


Building Block 2: Tokenization

What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like ["parser"], ["par", "pars", "parse", "parser"], or ["par", "ars", "rse", "ser"] depending on which tokenizer we use.

Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos.

All tokenizers implement a simple interface:

interface TokenizerInterface
{
    public function tokenize(string $text): array;  // Returns array of Token objects
    public function getWeight(): int;               // Returns tokenizer weight
}

Simple contract, easy to extend.

Word Tokenizer

This one is straightforward—it splits text into individual words. "parser" becomes just ["parser"]. Simple, but powerful for exact matches.

First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace:

class WordTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Normalize: lowercase, remove special chars
        $text = mb_strtolower(trim($text));
        $text = preg_replace('/[^a-z0-9]/', ' ', $text);
        $text = preg_replace('/\s+/', ' ', $text);

Next, we split into words and filter out short ones:

        // Split into words, filter short ones
        $words = explode(' ', $text);
        $words = array_filter($words, fn($w) => mb_strlen($w) >= 2);

Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search.

Finally, we return unique words as Token objects:

        // Return as Token objects with weight
        return array_map(
            fn($word) => new Token($word, $this->weight),
            array_unique($words)
        );
    }
}

Weight: 20 (high priority for exact matches)

Prefix Tokenizer

This generates word prefixes. "parser" becomes ["par", "pars", "parse", "parser"] (with min length 4). This helps with partial matches and autocomplete-like behavior.

First, we extract words (same normalization as WordTokenizer):

class PrefixTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $minPrefixLength = 4,
        private int $weight = 5
    ) {}
    
    public function tokenize(string $text): array
    {
        // Normalize same as WordTokenizer
        $words = $this->extractWords($text);

Then, for each word, we generate prefixes from the minimum length to the full word:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Generate prefixes from min length to full word
            for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) {
                $prefix = mb_substr($word, 0, $i);
                $tokens[$prefix] = true; // Use associative array for uniqueness
            }
        }

Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token.

Finally, we convert the keys to Token objects:

        return array_map(
            fn($prefix) => new Token($prefix, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 5 (medium priority)

Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful.

N-Grams Tokenizer

This creates character sequences of a fixed length (I use 3). "parser" becomes ["par", "ars", "rse", "ser"]. This catches typos and partial word matches.

First, we extract words:

class NGramsTokenizer implements TokenizerInterface
{
    public function __construct(
        private int $ngramLength = 3,
        private int $weight = 1
    ) {}
    
    public function tokenize(string $text): array
    {
        $words = $this->extractWords($text);

Then, for each word, we slide a window of fixed length across it:

        $tokens = [];
        foreach ($words as $word) {
            $wordLength = mb_strlen($word);
            // Sliding window of fixed length
            for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) {
                $ngram = mb_substr($word, $i, $this->ngramLength);
                $tokens[$ngram] = true;
            }
        }

The sliding window: for "parser" with length 3, we get:

  • Position 0: "par"
  • Position 1: "ars"
  • Position 2: "rse"
  • Position 3: "ser"

Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser".

Finally, we convert to Token objects:

        return array_map(
            fn($ngram) => new Token($ngram, $this->weight),
            array_keys($tokens)
        );
    }
}

Weight: 1 (low priority, but catches edge cases)

Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos.

Normalization

All tokenizers do the same normalization:

  • Lowercase everything
  • Remove special characters (keep only alphanumerical)
  • Normalize whitespace (multiple spaces to single space)

This ensures consistent matching regardless of input format.


Building Block 3: The Weight System

We have three levels of weights working together:

  1. Field weights: Title vs content vs keywords
  2. Tokenizer weights: Word vs prefix vs n-gram (stored in index_tokens)
  3. Document weights: Stored in index_entries (calculated: field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Final Weight Calculation

When indexing, we calculate the final weight like this:

$finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength));

For example:

  • Title field: weight 10
  • Word tokenizer: weight 20
  • Token "parser": length 6
  • Final weight: 10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600

Why use ceil(sqrt())? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use ceil() to round up to the nearest integer, keeping weights as whole numbers.

Tuning Weights

You can adjust weights for your use case:

  • Increase field weights for titles if titles are most important
  • Increase tokenizer weights for exact matches if you want to prioritize exact matches
  • Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less

You can see exactly how weights are calculated and adjust them as needed.


Building Block 4: The Indexing Service

The indexing service takes a document and stores all its tokens in the database.

The Interface

Documents that can be indexed implement IndexableDocumentInterface:

interface IndexableDocumentInterface
{
    public function getDocumentId(): int;
    public function getDocumentType(): DocumentType;
    public function getIndexableFields(): IndexableFields;
}

To make a document searchable, you implement these three methods:

class Post implements IndexableDocumentInterface
{
    public function getDocumentId(): int
    {
        return $this->id ?? 0;
    }
    
    public function getDocumentType(): DocumentType
    {
        return DocumentType::POST;
    }
    
    public function getIndexableFields(): IndexableFields
    {
        $fields = IndexableFields::create()
            ->addField(FieldId::TITLE, $this->title ?? '', 10)
            ->addField(FieldId::CONTENT, $this->content ?? '', 1);
        
        // Add keywords if present
        if (!empty($this->keywords)) {
            $fields->addField(FieldId::KEYWORDS, $this->keywords, 20);
        }
        
        return $fields;
    }
}

Three methods to implement:

  • getDocumentType(): returns the document type enum
  • getDocumentId(): returns the document ID
  • getIndexableFields(): builds fields with weights using fluent API

You can index documents:

  • On create/update (via event listeners)
  • Via commands: app:index-document, app:reindex-documents
  • Via cron (for batch reindexing)

How It Works

Here's the indexing process, step by step.

First, we get the document information:

class SearchIndexingService
{
    public function indexDocument(IndexableDocumentInterface $document): void
    {
        // 1. Get document info
        $documentType = $document->getDocumentType();
        $documentId = $document->getDocumentId();
        $indexableFields = $document->getIndexableFields();
        $fields = $indexableFields->getFields();
        $weights = $indexableFields->getWeights();

The document provides its fields and weights via the IndexableFields builder.

Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it:

        // 2. Remove existing index for this document
        $this->removeDocumentIndex($documentType, $documentId);
        
        // 3. Prepare batch insert data
        $insertData = [];

Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh.

Now, we process each field. For each field, we run all tokenizers:

        // 4. Process each field
        foreach ($fields as $fieldIdValue => $content) {
            if (empty($content)) {
                continue;
            }
            
            $fieldId = FieldId::from($fieldIdValue);
            $fieldWeight = $weights[$fieldIdValue] ?? 0;
            
            // 5. Run all tokenizers on this field
            foreach ($this->tokenizers as $tokenizer) {
                $tokens = $tokenizer->tokenize($content);

For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight:

                foreach ($tokens as $token) {
                    $tokenValue = $token->value;
                    $tokenWeight = $token->weight;
                    
                    // 6. Find or create token in index_tokens
                    $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight);
                    
                    // 7. Calculate final weight
                    $tokenLength = mb_strlen($tokenValue);
                    $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength)));
                    
                    // 8. Add to batch insert
                    $insertData[] = [
                        'token_id' => $tokenId,
                        'document_type' => $documentType->value,
                        'field_id' => $fieldId->value,
                        'document_id' => $documentId,
                        'weight' => $finalWeight,
                    ];
                }
            }
        }

Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query.

Finally, we batch insert everything:

        // 9. Batch insert for performance
        if (!empty($insertData)) {
            $this->batchInsertSearchDocuments($insertData);
        }
    }

The findOrCreateToken method is straightforward:

    private function findOrCreateToken(string $name, int $weight): int
    {
        // Try to find existing token with same name and weight
        $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?";
        $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative();
        
        if ($result) {
            return (int) $result['id'];
        }
        
        // Create new token
        $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)";
        $this->connection->executeStatement($insertSql, [$name, $weight]);
        
        return (int) $this->connection->lastInsertId();
    }
}

Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates.

The key points:

  • We remove old index first (handles updates)
  • We batch insert for performance (one query instead of many)
  • We find or create tokens (avoids duplicates)
  • We calculate final weight on the fly

Building Block 5: The Search Service

The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores.

How It Works

Here's the search process, step by step.

First, we tokenize the query using all tokenizers:

class SearchService
{
    public function search(DocumentType $documentType, string $query, ?int $limit = null): array
    {
        // 1. Tokenize query using all tokenizers
        $queryTokens = $this->tokenizeQuery($query);
        
        if (empty($queryTokens)) {
            return [];
        }

If the query produces no tokens (e.g., only special characters), we return empty results.

Why Tokenize the Query Using the Same Tokenizers?

Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches.

Example:

  • Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser"
  • Searching with only WordTokenizer creates token: "parser"
  • We'll find "parser", but we won't find documents that only have "par" or "pars" tokens
  • Result: Incomplete matches, missing relevant documents!

The solution: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches.

This is why the SearchService and SearchIndexingService both receive the same set of tokenizers.

Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate:

        // 2. Extract unique token values
        $tokenValues = array_unique(array_map(
            fn($token) => $token instanceof Token ? $token->value : $token,
            $queryTokens
        ));

Why extract values? We search by token name, not by weight. We need the unique token names to search for.

Then, we sort tokens by length (longest first). This prioritizes specific matches:

        // 3. Sort tokens (longest first - prioritize specific matches)
        usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a));

Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first.

We also limit the token count to prevent DoS attacks with huge queries:

        // 4. Limit token count (prevent DoS with huge queries)
        if (count($tokenValues) > 300) {
            $tokenValues = array_slice($tokenValues, 0, 300);
        }

Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted).

Now, we execute the optimized SQL query. The executeSearch() method builds the SQL query and executes it:

        // 5. Execute optimized SQL query
        $results = $this->executeSearch($documentType, $tokenValues, $limit);

Inside executeSearch(), we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects:

private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array
{
    // Build parameter placeholders for token values
    $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?'));
    
    // Build the SQL query (shown in full in "The SQL Query" section below)
    $sql = "SELECT sd.document_id, ... FROM index_entries sd ...";
    
    // Build parameters array
    $params = [
        $documentType->value,  // document_type
        ...$tokenValues,       // token values for IN clause
        $documentType->value,  // for subquery
        ...$tokenValues,       // token values for subquery
        $minTokenWeight,      // minimum token weight
        // ... more parameters
    ];
    
    // Execute query with parameter binding
    $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative();
    
    // Filter out results with low normalized scores (below threshold)
    $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05);
    
    // Convert to SearchResult objects
    return array_map(
        fn($result) => new SearchResult(
            documentId: (int) $result['document_id'],
            score: (float) $result['score']
        ),
        $results
    );
}

The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it.

The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection).

We'll see the full query in the next section.

The main search() method then returns the results:

        // 5. Return results
        return $results;
    }
}

The Scoring Algorithm

The scoring algorithm balances multiple factors. Let's break it down step by step.

The base score is the sum of all matched token weights:

SELECT 
    sd.document_id,
    SUM(sd.weight) as base_score
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
GROUP BY sd.document_id
  • sd.weight: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Why not multiply by st.weight? The tokenizer weight is already included in sd.weight during indexing. The st.weight from index_tokens is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight).

This gives us the raw score. But we need more than that.

We add a token diversity boost. Documents matching more unique tokens score higher:

(1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_score

Why? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost.

We also add an average weight quality boost. Documents with higher quality matches score higher:

(1.0 + LOG(1.0 + AVG(sd.weight))) * base_score

Why? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic.

We apply a document length penalty. Prevents long documents from dominating:

base_score / (1.0 + LOG(1.0 + doc_token_count.token_count))

Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty.

Finally, we normalize by dividing by the maximum score:

score / GREATEST(1.0, max_score) as normalized_score

This gives us a 0-1 range, making scores comparable across different queries.

The full formula looks like this:

SELECT 
    sd.document_id,
    (
        SUM(sd.weight) *                                  -- Base score
        (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- Token diversity boost
        (1.0 + LOG(1.0 + AVG(sd.weight))) /              -- Average weight quality boost
        (1.0 + LOG(1.0 + doc_token_count.token_count))   -- Document length penalty
    ) / GREATEST(1.0, max_score) as score                -- Normalization
FROM index_entries sd
INNER JOIN index_tokens st ON sd.token_id = st.id
INNER JOIN (
    SELECT document_id, COUNT(*) as token_count
    FROM index_entries
    WHERE document_type = ?
    GROUP BY document_id
) doc_token_count ON sd.document_id = doc_token_count.document_id
WHERE 
    sd.document_type = ?
    AND st.name IN (?, ?, ?)  -- Query tokens
    AND sd.document_id IN (
        SELECT DISTINCT document_id
        FROM index_entries sd2
        INNER JOIN index_tokens st2 ON sd2.token_id = st2.id
        WHERE sd2.document_type = ?
        AND st2.name IN (?, ?, ?)
        AND st2.weight >= ?  -- Ensure at least one token with meaningful weight
    )
GROUP BY sd.document_id
ORDER BY score DESC
LIMIT ?

Why the subquery with st2.weight >= ?? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token.

Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do.

If no results with weight 10, we retry with weight 1 (fallback for edge cases).

Converting IDs to Documents

The search service returns SearchResult objects with document IDs and scores:

class SearchResult
{
    public function __construct(
        public readonly int $documentId,
        public readonly float $score
    ) {}
}

But we need actual documents, not just IDs. We convert them using repositories:

// Perform search
$searchResults = $this->searchService->search(
    DocumentType::POST,
    $query,
    $limit
);

// Get document IDs from search results (preserving order)
$documentIds = array_map(fn($result) => $result->documentId, $searchResults);

// Get documents by IDs (preserving order from search results)
$documents = $this->documentRepository->findByIds($documentIds);

Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results.

The repository method handles the conversion:

public function findByIds(array $ids): array
{
    if (empty($ids)) {
        return [];
    }
    
    return $this->createQueryBuilder('d')
        ->where('d.id IN (:ids)')
        ->setParameter('ids', $ids)
        ->orderBy('FIELD(d.id, :ids)')  // Preserve order from IDs array
        ->getQuery()
        ->getResult();
}

The FIELD() function preserves the order from the IDs array, so documents appear in the same order as search results.


The Result: What You Get

What you get is a search engine that:

  • Finds relevant results quickly (leverages database indexes)
  • Handles typos (n-grams catch partial matches)
  • Handles partial words (prefix tokenizer)
  • Prioritizes exact matches (word tokenizer has highest weight)
  • Works with existing database (no external services)
  • Easy to understand and debug (everything is transparent)
  • Full control over behavior (adjust weights, add tokenizers, modify scoring)

Extending the System

Want to add a new tokenizer? Implement TokenizerInterface:

class StemmingTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        // Your stemming logic here
        // Return array of Token objects
    }
    
    public function getWeight(): int
    {
        return 15; // Your weight
    }
}

Register it in your services configuration, and it's automatically used for both indexing and searching.

Want to add a new document type? Implement IndexableDocumentInterface:

class Comment implements IndexableDocumentInterface
{
    public function getIndexableFields(): IndexableFields
    {
        return IndexableFields::create()
            ->addField(FieldId::CONTENT, $this->content ?? '', 5);
    }
}

Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control.


Conclusion

So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect.

The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says.

You own it, you control it, you can debug it. And that's worth a lot.

Read the whole story
emrox
50 days ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories