3704 stories
·
3 followers

A 2-ply minimax chess engine in 84,688 regular expressions

1 Share

Can chess, with regular expressions? Yes. Can chess, with regular expressions.

Over the holidays I decided it's been too long since I did something with entirely no purpose. So without further ado, I present to you ... Regex Chess: sequence of 84,688 regular expressions that, when executed in order, will play a (valid; not entirely terrible) move given a chess board as input. Here, I'll show you.

M

o

v

e

n

o

t

a

t

i

o

n

:

[

s

r

c

]

[

d

e

s

t

]

(

e

.

g

.

e

2

e

4

)

o

r

'

q

'

t

o

q

u

i

t

[

C

a

s

t

l

i

n

g

R

i

g

h

t

s

:

K

Q

k

q

,

E

n

P

a

s

s

a

n

t

:

-

]

Current executing regular expression will show here...

a8

b8

c8

d8

e8

f8

g8

h8

a7

b7

c7

d7

e7

f7

g7

h7

a6

b6

c6

d6

e6

f6

g6

h6

a5

b5

c5

d5

e5

f5

g5

h5

a4

b4

c4

d4

e4

f4

g4

h4

a3

b3

c3

d3

e3

f3

g3

h3

a2

b2

c2

d2

e2

f2

g2

h2

a1

b1

c1

d1

e1

f1

g1

h1

Specifically, this is the entirety of the program that is playing a move against you (no really, I'm not kidding, it really is this short):

let regex_list = [/* a very long list of regular expressions */]

let board = "rnbqkbnr / pppppppp / 8 / 8 / 8 / 8 / PPPPPPPP / RNBQKBNR w KQkq - 0 1";

for (regex of regex_list) {

    board = re.replace(regex.pattern, regex.target)

}

display(board)

By the end of this post you'll (hopefully) understand why this sequence of regular Now some snobby people when they see this are going to say something like "yoU SaiD yoU'Re GoIng to uSe rEGUlar EXPresSIONs buT thESE ArE nOT ReGULaR !!" I do not care. expressions is possible, and also what the specific regular expressions do.

(If you're someone who subscribed to this blog in the last six months or so, and have gotten accustom to me writing about "serious" and "important" things, please treat this as your fair warning that this is MY WEBSITE and I MAKE THE RULES HERE and so today you're learning about RegexChess whether you like it or not.)

As always, code for this project is available on GitHub.


Getting Started: A Regular Expression CPU

So how are we going to get regular expressions to play chess? Well, by making a regular expression computer, of course! More specifically, we're going to design a Branch-Free, Conditional-Execution, Single-Instruction Multiple-Data instruction set. And then make a sequence of regular expressions that interpret these instructions. (Kind of like a GPU instruction set. And a bit of ARM. But a lot slower.) And then from there we can program our new computer to play chess. So let's get started.

(Some people may say I have an unhealthy obsession with building weird computers, c.f. my game of life computer or my printf computer. Those people are wrong, and just have an unhealthy obsession with the banal and ordinary.)


Computer Design

Let me get started by explaining how I'm going to organize the data that the computer is going to operate over. Because we're working with regular expressions, the current state of the computer is going to be represented as a single string containing both the program "stack", and all variables in the following format:

%% #stack: top item on stack second item on stack .... #variable1: value 1 #variable2: value 2 ... #variablek: value k

Each instruction will either manipulate some variables on the stack, or will read or write to a given variable. Let's look at some basic instructions.

Basic Stack Operations

The Push Instruction

Here's the implementation of the push command that adds a value to the top of the stack:

def push(const):

    return [(r"(%%\n#stack:\n)",

             r"\g<1>"+const+r"\n")]

You should read the return type of these functions as a list of tuples. Each tuple represents a regex transformation to apply, where the left element is the pattern to match, and the right element is the replacement.

As a brief regular expression refresher. Each tuple in this list has two parts: the regular expression, and the replacement. A regular expression will match a string if it can find a substring of whatever it's being applied against (the state, in our case). Most characters in a regular expression match themselves, but parentheses create a "match group" that can be referenced later.

The second argument is the replacement string. Again, most characters mean "Replace with this character", but special sequences like \g<1> are back-references that refer to previously captured groups. In this case, \g<1> references the first captured group (anything matched within the first set of parentheses)---which in this case is the "%%\n#stack:\n" header.

So, as a result of this operation executing on the stack, we find the occurrence of "%%\n#stack:\n" in the state, and insert the constant value below that (to the top of the stack).

Let's see this in practice. Say we start with an empty stack:

%% #stack:

If we execute push("hello"), the regular expression will:

  • Match the pattern %%\n#stack:\n at the start of our state
  • Capture this header into group 1 (the parentheses in the pattern create this capture group)
  • Replace it with the captured group (\g<1>) followed by our constant "hello" and a newline

This gives us:

%% #stack: hello

If we then execute push("world"), the same process repeats and we get:

%% #stack: world hello

The regular expression always matches at the top of the stack area, so new items get pushed on top while preserving the existing stack contents below them.

The Pop Instruction

The pop instruction removes the top element from the stack:

def pop():

    return [(r"(%%\n#stack:\n)([^\n]*)\n",

            r"\1")]

Here we start to see a few of the special operators that make regular expressions powerful. The [^\n] block means "match any character that isn't a newline" and the * means "match zero or more of these." So taken together we're looking for a line that starts with "%%\n#stack:\n", and then on the next line, zero or more characters that aren't a newline (so, the entire line). The replacement is just the first line, which has the effect of removing the second line, popping the top of the stack.

Let's see how this works in practice. Say we start with this stack:

%% #stack: world hello

When we execute pop(), the regular expression will:

  • Match the pattern beginning with %%\n#stack:\n (captured in group 1)
  • Match any characters until the next newline (captured in group 2 - the "world")
  • Replace everything matched with just group 1 (the header), effectively removing the top element

This gives us:

%% #stack: hello

Each pop operation removes exactly one element from the top of the stack, preserving any remaining elements below it.

Variable <-> Stack Instructions

Variable Lookup

To load the contents of a variable onto the top of the stack:

def lookup(variable):

    # Find the variable's value and push it onto the stack                                                

    return [(r"(%%\n#stack:)([^%]*\n#"+variable+": )([^\n]*)\n",

            r"\1\n\3\2\3\n")]

This regular expression is a bit more complex than our previous ones. Let's break down what it does:

  • The [^%]* matches basically any character (% only appears at the start of the program) and so lets us find any variable anywhere in the program.
  • The [^\n]* matches the variable's value by capturing everything until the end of the line
  • The replacement creates a copy of the value and places it on top of the stack

Let's see how this works in practice. Say we start with this state:

%% #stack: #foo: hello #bar: world #baz: 42

If we execute lookup("bar"), the regular expression will:

  • Match the stack header in group 1
  • Match everything up to and including "#bar: " in group 2
  • Match "world" in group 3
  • Use these groups to reconstruct the state with the value copied to the top of the stack

And so running the replacement will result in the following state:

%% #stack: world #foo: hello #bar: world #baz: 42

The lookup operation preserved the original variable and its value while also placing a copy of the value on top of the stack. This allows us to read variable values without modifying them.

Variable Assignment

Assigning to a variable presents an interesting challenge: we don't know if the variable already exists. We need to handle both cases: updating an existing variable or creating a new one.

Let me show you the implementation and then I'll walk you through it case by case.

def assign_pop(varname):

    return [

        (r"(%%)\n#stack:\n([^\n]*)\n" +

           "([^%]*#" + varname + r": )[^\n]*",

        r"\1`\n#stack:\n\3\2"),

        (r"(%%)([^`]\n?#stack:\n)([^\n%]*)\n([^%]*)",

        r"\1`\2\4#" + varname + r": \3\n"),

        (r"%%`",

        r"%%")

    ]

To begin let's assume the variable already exists. That is, the stack starts off looking like this, and assume we're calling assign_pop("bar"):

%% #stack: world #foo: hello #bar: something #othervar: othervalue

When we run this regular expression list, we create the following capture groups:

%% #stack: world #foo: hello #bar: something #othervar: othervalue

After the replacement operation, we get this output:

%%` #stack: #foo: hello #bar: world #othervar: othervalue

Now we proceed on to the next instruction, and we don't match it because the second regex fails if there's a back-tick after the program start %%. So nothing happens. And then finally, the third regex cleans things up for us.

Handling Non-Existent Variables: Let's consider what happens if the variable doesn't already exist. Again, assume we're calling assign_pop("bar"):

%% #stack: world #foo: hello #othervar: othervalue

The first regex tries to match but fails because there is no "#bar" anywhere. So it doesn't do anything. But now the second regex tries to match and succeeds. It creates the following capture groups:

%% #stack: world #foo: hello #othervar: othervalue

From here, we perform the rewrite and get the following output:

%% #stack: #foo: hello #othervar: othervalue #bar: world

And then the third regex is applied and does nothing.

There are lots of instructions that use this trick to make sure we don't apply things in the order we don't want. For example, as an exercise for yourself, try to understand how the "is equal" instruction works:

def eq():

    return [

        (r"(%%\n#stack:\n)([^\n]*)\n\2\n",

         r"\1`True\n"),

        (r"(%%\n#stack:\n)([^`][^\n]*)\n([^\n]*)\n",

         r"\1False\n"),


(Branch-Free) Conditionals

Programming languages, in order to be interesting, usually need to have some kind of control flow. It's very hard to write some program without ever having an if statement. So let's now show how we're going to do this. (And I hope you did your homework, because we're going to use the same conditional execution trick again!) Here's the implementation of a conditional instruction:

def cond(tag):

    return [(r"%(%\n#stack:\nTrue)",

             r"%\1`"),

            (r"%(\n#stack:\nFalse)",

             tag+r"\1`"),

            (r"\n(True|False)`\n",

             "\n")]

Let's walk through how this is going to work, starting with the case where the top of the stack is False.

%% #stack: False #variable: value

Initially, the first regex will fail to match, because the top element on the stack isn't True. So we go to the next regex, and see if it applies. This one does match, and makes the corresponding match groups.

%% #stack: False #variable: value

After we apply the replacement, we get the following stack.

%tag #stack: False` #variable: value

(And finally, using the same cleanup trick, we'll remove the used marker.)

Now see what happened here? The program no longer begins with %%. This means that EVERY instruction will fail to match, because they always make sure that the program begins with %%. So nothing else will happen.... until we reactivate it later with the following simple instruction:

def reactivate(tag):

    return [(r"%"+tag+r"\n([^%]*)",

        r"%%\n\1")]

Let's now return to the True case for the conditional. This is the easy case: we basically don't do anything at all. We replace the stack with True` on the second regex, and then delete this line on the third. Easy.

Notice that this means our code is actually branch-free, because every instruction is a conditional instruction. (Kind of like ARM's predicated execution, where most instructions can be conditionally executed based on status flags rather than using explicit branch instructions.)


Loops (are impossible)

Because our program just consists of a sequence of regular expressions, you can't loop at all! That, technically, means we can't actually perform Turing Complete But we can do any bounded computation by just unrolling any loops we may have. And fortunately computing the next move in a chess position is a bounded computation, so we can do just that.

Single-Instruction Multiple-Data

And now for my absolute favorite part of the language we've developed. By the magic of regular expressions (and the fact that they perform substitution globally over the entire string), we can run multiple threads simultaneously!

That is, if we just write our state string as:

%% #stack: int0000101010 int0001011100 %% #stack: int0000001101 int0110000000

When we call binary_add(), both additions happen simultaneously! After execution:

%% #stack: int0010001110 %% #stack: int0110001101

The reason this happens is because the regular expression matches work globally. When we match the "begin thread" operator (%%) twice, we get to perform operations on both threads simultaneously.

So how do we actually make use of this feature? Let's look at some instructions that help us create and manage threads.


The Fork Instructions

Here's a simple fork instruction that splits every currently running thread into two, with the second one starting off inactive with a given tag:

def fork_inactive(tag):

    return [(r"%%\n([^%]*)",

             r"%%\n\1" + "%"+tag+r"\n\1")

            ]

We can also, for example, fork() on a boolean, giving one thread the True case and another the False case. (This is something like McCarthy's Amb operator reference)

def fork_bool(variable):

    return [(r"%%\n([^%]*)",

             r"%%\n\1#"+variable+r": True\n%%\n\1#"+variable+r": False\n")

Let's see what happens when we apply multiple forks. Starting with a simple state:

%% #stack: somevalue #x: 10

After calling fork_bool("condition"), we get:

%% #stack: somevalue #x: 10 #condition: True %% #stack: somevalue #x: 10 #condition: False

If we then call fork_bool("c2"), each existing thread splits into two:

%% #stack: somevalue #x: 10 #condition: True #c2: True %% #stack: somevalue #x: 10 #condition: True #c2: False %% #stack: somevalue #x: 10 #condition: False #c2: True %% #stack: somevalue #x: 10 #condition: False #c2: False

Now we have four simultaneous execution paths, exploring every possible combination of our boolean conditions at the same time. This is exceptionally useful for chess, when we might frequently want to consider multiple possible board states at the same time, and (for example) score them to see which is best. Instead of having to loop over every possible board state, we can just pretend we were doing it once but have them all happen at the same time.


Compiling to our little language

Now that we have our CPU emulator, we can build a compiler to target our new assembly language.

"But wait I'm not reading this post for a lesson in compilers!" you say?? Fair point. Also I didn't go into this project trying to build a compiler, so instead what I have is more of a macro-assembler. It turns python-ish programs like this:

def fib():

    a = 1

    b = 2

    for _ in range(10):

        next = a + b

        a = b

        b = next

Into a sequence of instructions like this:

push(1) assign_pop('a') push(2) assign_pop('b') lookup('a') lookup('b') binary_add() assign_pop('next') lookup('b') assign_pop('a') lookup('next') assign_pop('b') [... repeated 8 more times ...]


Compiling Through Symbolic Execution

Rather than writing a traditional compiler with parsing and code generation phases, I took an unusual approach: symbolic execution. The key insight is that we can "execute" the Python code in a special way that records what operations would happen rather than actually performing them.

Here's how it works: the variables argument isn't actually a dictionary---it's a special object that records every operation performed on it. It creates what we call a "trace" of the execution. When you write:

a = b + 1

The tracer object records four operations:

  1. A lookup of variable 'b'
  2. A push of the constant 1
  3. A binary_add operation
  4. An assign_pop to variable 'a'

Handling Control Flow

The trickiest part of this approach is handling branching control flow---if statements, specifically. (What about loops? We don't have those, so I never use them. Loops can only have constants.) We need to make sure we capture all possible execution paths. Here's how we do it:

When we hit a conditional, we create two separate paths in our trace---one for when the condition is true, and one for when it's false. Each path records its own sequence of operations. Later, we merge these paths back together.

For example, this Python code:

if x > 0:

    y = 1

else:

    y = 2

Generates something like this trace structure:

lookup('x') # Get x's value push(0) # Push 0 greater_than() # Compare cond('tag1') # Branch on result # True path: push(1) assign_pop('y') pause('tag2') reactivate('tag1') # False path: push(2) assign_pop('y') reactivate('tag2')

The magic of the compiler lies in how it handles control flow through branching. Let me explain this in a little bit more detail.

When our compiler first starts processing code, it maintains a single linear path of instructions in the CallTree. Each instruction gets appended one after another as we go. This list of instructions looks entirely linear. When we reach a conditional statement, though, things get interesting.

Consider what happens when we hit a conditional statement like the x > 0 conditional above. When this happens, I detect it, and create a branch in the call tree representing the two paths simultaneously. Then, I just pretend this conditional was true, and fill out the true case of the tree. When we reach the end of the program we've done some of our compilation, but not all of it.

Now comes the clever part. When compiling, we don't just trace the program once. We trace many times. And each time we trace, we arrange for the conditionals to go through whichever branch has been taken least often. In this way, the second time around, we record the instructions for the false branch.

Finally, once the conditional is over, we detect this and merge the two branches back together. This branching and merging mechanism is more than just a clever trick---it's essential to how our regex-based CPU actually works. When we convert this tree into instructions, each branch gets translated into a conditional regex operation (using our cond instruction) that can selectively activate and deactivate different parts of our program state. The merge points become reactivate instructions that ensure we continue execution on the correct path.


Writing a Chess Engine

Okay, so we're finally to the part of this post where we can actually start writing a chess engine. (And at the part where some of the emulator design decisions will make sense.)

For the most part, this is entirely straightforward and mirrors how you would write a chess engine in any other programming language. But this branching thing with SIMD is what gives us the power to make things go fast.

Let's consider the following (simplified) program that calculates all the valid pawn moves.

def pawn_moves(initial_board):

    # Step 1: Find all pawns and create a list of their positions 

    pawnpos_list = find_all_pawns(initial_board)

    # Step 2: Create parallel states for each pawn (up to 8)

    MAX_PAWNS = 8

    for iteration in range(MAX_PAWNS):

        if not pawn_list.is_empty():

            pawn_pos = pawnpos_lst.remove_first()

            fork_inactive("waiting")

    # Step 3: Switch to processing the parallel states

    pause("main")

    reactivate("inactive")

    # Step 4: Generate moves for all pawns simultaneously 

    candidate_moves = []

    if initial_board[pawn_pos + (0, 1)].is_empty():

        candidate_moves.append(pawn_pos + (0, 1))

        if pawn_pos.y == 2:

            if initial_board[pawn_pos + (0, 2)].is_empty():

                candidate_moves.append(pawn_pos + (0, 2))

    if initial_board[pawn_pos + (1, 1)].has_opponent():

        candidate_moves.append(pawn_pos + (1, 1))

    if initial_board[pawn_pos + (-1, 1)].has_opponent():

        candidate_moves.append(pawn_pos + (-1, 1))

    # Step 5: Switch back and merge results 

    pause("inactive")

    reactivate("main")

    candidate_moves = merge_variables_from_threads("inactive")

Step 1: Finding the Pawns

The find_all_pawns() function scans through our board representation, looking for white pawns (represented as 'P' in the FEN string). It returns a list of the position of each of these pawns. As an example, if we run our program on the following position with three pawns on d2, e2, and f2, this creates a semicolon-separated list in the pawnpos_lst variable as follows

%% #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos_lst: d2;e2;f2;

Step 2: State Creation

Now comes the clever part. The fork_inactive instruction, as described above, duplicates our entire program state. Each time it runs, it creates an exact copy of the currently running thread, but marks the new copy with %waiting instead of %%. (Recall this means instructions won't apply to this thread.) At the same time, it takes one position from our pawnpos_lst and assigns it to a new variable pawnpos in the copied state.

When our loop runs three times, each fork_inactive operation splits off a new parallel universe where we'll process a different pawn. The regex operation that does this copying preserves all existing variables but adds the new pawnpos variable with the specific position for that copy to process.

%% #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos_lst: %waiting #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos: d2 %waiting #stack: #board: 4k3/8/8/8/8/3PPP2/4K3 #pawnpos: e2 %waiting #stack: #board: 4k3/8/8/8/8/3PPP2/4K3 #pawnpos: f2

Step 3: Activation Switch

Recall that the pause and reactivate operations work by manipulating the %% markers that indicate which states are active. The pause("main") operation changes our original %% to %main, effectively deactivating it. Then reactivate("inactive") finds all states marked with %waiting and changes them to %%, making them the new active states.

Step 4: Parallel Move Generation

Here's where the SIMD magic happens. Each check for a possible move---forward one square, forward two squares, or diagonal captures---executes across all active states simultaneously. When we check if the square ahead is empty, we're actually checking d3, e3, and f3 all in one operation. For each valid move, we add this to the candidate_moves list.

(I've significantly simplified the program for visual purposes here. In reality I don't work directly over the FEN strings but expand them to 64 independent variables for each of the 64 squares and read and write to these squares directly. This makes processing the board state much easier. But for the purpose of this blog post it's easier to show visually as if it worked on FEN strings alone.)

%main #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos_lst: %% #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos: d2 #candidate_moves: d3;d4 %% #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos: e2 #candidate_moves: e3;e4 %% #stack: #board: 4k3/8/8/8/8/8/3PPP2/4K3 #pawnpos: f2 #candidate_moves: f3;f4

Step 5: Merging Results

The final merge operation combines all the candidate_moves lists from our parallel states. It first switches activation back to the main state using pause and reactivate. Then merge_variables_from_threads (again, a pseudo-op I've made up for visual clarity, in practice this requires like 10 different instructions on my real machine) match all the candidate_moves lists across our inactive states and concatenates them together.

What this means is that while the code we wrote code looks like it processes one pawn at a time, our regex engine's ability to process multiple states means we're actually handling all pawns simultaneously. Every operation, from checking empty squares to building move lists, happens in parallel across all our active states.

And this is how every piece operation works. Because this post is already getting quite long, I won't walk through each piece one by one, but if you're interested definitely go look at the chess-engine.py for the details.


Playing a Turn

Now let's walk through the overall game loop that handles everything.

def play_turn():

    # Step 1: Read the human entered move from the input                                                  

    board_before_move, src, dst = from_pretty_utf8_to_fen()

    # Step 2: Check if their move is valid                                                                

    after_move = board_before_move.make_move(src, dst)

    next_boards = compute_legal_boards(board_before_move)

    next_board = fork_on_list(next_boards)

    if after_move != next_board:

        destroy_active_thread()

    # Step 3: Generate the computer's reply                                                               

    candidate_boards = compute_and_score_legal_boards(after_move)

    candidate_board = fork_on_list(candidate_boards)

    keep_best_scoring_board(score)

    from_fen_to_pretty_utf8(candidate_board)

Say we're at the start of a game, and the human enters "e2e4" (the king's pawn opening). Here's how our code processes this:

Step 1: Reading the Move

Initially, the function from_pretty_utf8_to_fen() converts our pretty-printed board with Unicode chess pieces back into FEN notation. It also extracts the source and destination squares from the input:

%% #stack: #board: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR #src: e2 #dst: e4

Step 2: Move Validation

Now we need to check if this is a valid move. Rather than writing entirely new code that explicitly checks if a move is legal, we use our parallel processing power again. The process works in three stages:

First, make_move applies the human's move to create a new board state:

%% #stack: #board: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR #after_move: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR

Then compute_legal_boards generates all possible legal moves from the starting position, creating a list like:

%% #stack: #next_boards: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR; rnbqkbnr/pppppppp/8/8/8/3P4/PPP1PPPP/RNBQKBNR; rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR; ...

Finally, fork_on_list creates parallel states for each legal board position:

%% #stack: #board: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR %% #stack: #board: rnbqkbnr/pppppppp/8/8/8/3P4/PPP1PPPP/RNBQKBNR %% #stack: #board: rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR

The destroy_active_thread() call removes any thread where the board doesn't match after_move. In our case, only the e2-e4 position survives, confirming it's a legal move. (If no legal move is made, I have a special regex that will replace the entire output with just the hard-coded text "Illegal Move.")

Step 3: Computer's Reply

Now we repeat a similar process to find the best reply. First, compute_and_score_legal_boards generates all possible black responses. This function does a little bit of magic, and when it returns the next possible boards, it returns with each board the score of that board after whites next best move I'll explain how this works below. But suffice to know that what this function returns for now the possible positions as compute_legal_boards does, but also the score of the position. (This is where the MiniMax happens. The score here is how good each position is from the player's perspective, after they have made their best reply move.)

Read the whole story
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
emrox
23 minutes ago
reply
Hamburg, Germany
Share this story
Delete

Complex Django filters with Subquery

1 Share

Comments

Read the whole story
emrox
3 hours ago
reply
Hamburg, Germany
Share this story
Delete

Node’s new built-in support for TypeScript

2 Shares

Starting with v23.6.0, Node.js supports TypeScript without any flags. This blog post explains how it works and what to look out for.

Read the whole story
emrox
3 hours ago
reply
Hamburg, Germany
alvinashcraft
10 hours ago
reply
West Grove, PA
Share this story
Delete

Things to argue about over the holidays instead of politics III

1 Share
  1. How much should a couple talk if they are having dinner in a restaurant, after being together for one month/year/decade?

  2. If it’s safer to face backwards in vehicles, then what is it that’s shared by infants in cars and soldiers on military planes but no one else?

  3. Among rock bands with > 250 million albums sold, 5/6 are from the UK vs. 1/6 from the US. (Or 6/8 vs. 2/8 if you count Elton John and Michael Jackson). Why?

  4. If you swim in public pools, is that because you reject the research suggesting that most contain 30-80 liters of urine, or because you don’t mind?

  5. Is declining fertility destined to be reversed through growth of high-fertility subcultures, or could there be competing low-fertility cultures that peel people away so effectively that population declines forever?

  6. Does postapocalyptic fiction represent a yearning for the life and death of pre-history? If so, why is there so little fiction based on ordinary pre-history life?

  7. What is the best Brassica? I collect here the most common for readers negligent in their Brassica studies:

    species examples
    nigra black mustard
    oleracea kale, cabbage, collard greens, broccoli, cauliflower, Chinese broccoli, Brussels sprouts, turnip cabbage
    rapa bok choy, choy sum, napa cabbage, turnip, canola (sometimes)
    carinata Ethiopian rape, Ethiopian mustard
    napus rutabaga, rapeseed, turnip (sometimes), canola (often)
    juncea mustard greens, Chinese/Indian/Korean mustard, canola (sometimes)
  8. Does it matter to you, emotionally, that only nigra, oleracea, and rapa are diploid species, and the others are lame tetraploids that just add together the ancestral genomes (carinata=nigra+oleracea, napus=rapa+oleracea, juncea=nigra+rapa) as first proposed in 1935 in Woo Jang-choon’s Triangle of U theory?

  9. Which color of food should there be more of?

  10. Choose two yes/no opinion questions. Around 50% of the population should say yes to each. What do you choose to get the smallest possible intersection?

  11. Do you think you could cultivate a taste in types of music you don’t currently like if you tried? If yes, why don’t you?

  12. What percentage of young people’s increased propensity to explore new music is driven by social taste competition?

  13. Say you were born in China/America/India (assuming you weren’t) but you are still “you”. What’s the biggest difference?

  14. What are the odds that humans are just too dumb to understand the universe and our place in it?

  15. What are the odds that consciousness isn’t unified and that there isn’t just a single “you” in your head experiencing the world?

  16. If you could cure all heart disease or all cancer, which would you pick?

  17. Take the conditions in the current Diagnostic and Statistical Manual, e.g.

    • Antisocial personality disorder
    • Attention-deficit/hyperactivity disorder
    • Autism spectrum disorder
    • Avoidant personality disorder
    • Bipolar II disorder
    • Borderline personality disorder
    • Dissociative amnesia
    • Major depressive disorder
    • Paranoid personality disorder
    • Schizophrenia
    • Tourette’s disorder
    • etc.

    What fraction of these concepts will still exist in 100 years?

  18. Does time actually go faster as you get older, or do we just remember it differently?

  19. What foods are best eaten with chopsticks but don’t come from food traditions that use chopsticks? What about with Western cutlery? What about with your hands?

  20. I’ve written to the “corresponding author” of dozens of published research papers, always with polite and straightforward questions. Something like 80% do not respond, at all. Even when they promised the journal to make data available upon request, most do not respond. How much of a jerk would I be to publish a list of those papers?

  21. To what degree is the anomalous self-confidence of American economists explained simply by the existence of Milton Friedman?

  22. Say you can have your eyes to be upgraded to either (a) have a fourth group of cone cells peaking around 370 nm in the UV band, like birds or (b) to be sensitive to polarization, like cephalopods. Your visual cortex is also upgraded to process the information. Which do you choose?

  23. How different would the next 20 years be if things created entirely by humans AI got an unfakable gold star?

  24. How much do normies really miss out on by using a Bourdieu-007 final last (1) ACTUALLY FINAL 2b edited.txt style file naming scheme instead of learning a source control system?

  25. If you could make yourself enjoy something less, what would it be? Would that make you enjoy life more, overall?

  26. How much does it matter how open people are to persuasion? If people were much more/less open to being convinced by arguments and changing their minds, how much better/worse would the world be? Assume that people are only convinced by “good” arguments.

  27. Say you flip a standard US penny 20 times, getting 15 heads. What’s your best estimate for the true bias of the coin, if you think there’s a 50% chance Bayesian statistics are correct?

  28. Per minute, is there really any philosophy text that offers more insight than Existential Troopers?

  29. Who would win in a wrestling match, Jesus or Shirdi Sai Baba?

  30. In an alternative universe where when yeast broke down sugar molecules into fentanyl rather than alcohol, would we all get together and celebrate the new year by consuming fentanyl?

If you’d like to answer these questions instead of using them to start arguments as intended, you can do that here.

(previously) (previously)



Read the whole story
· · · · ·
emrox
19 days ago
reply
Hamburg, Germany
Share this story
Delete

Inside the Dune 2 score with Osmose and Hans Zimmer

1 Share

A sequel is only worth its weight in spice if it can best the original. For Dune 2, Hans Zimmer and his team went further with their electronic instrumentation to immerse us even deeper in otherworldly sounds. Now we get some insights into how they did it, and why the Expressive E Osmose was a starring player.

The post Inside the Dune 2 score with Osmose and Hans Zimmer appeared first on CDM Create Digital Music.

Read the whole story
emrox
27 days ago
reply
Hamburg, Germany
Share this story
Delete

Full Text Search in Milliseconds with Rails and PostgreSQL

1 Share

Postgres Full Text Search Example

Imagine the following scenario: You have a database full of job titles and descriptions, and you’re trying to find the best match. Typically you’d start by using an ILIKE expression, but this requires the search phrase to be an exact match. Then you might use trigrams, allowing spelling mistakes and inexact matches based on word similarity, but this makes it difficult to search using multiple words. What you really want to use is Full Text Search, providing the benefits of ILIKE and trigrams, with the added ability to easily search through large documents using natural language.

To summarize, here is a quick overview of popular built-in Postgres search options:

Postgres FeatureTypical Use CaseCan be indexed?Performance
LIKE/ILIKEWildcard-style search for small dataSometimesUnpredictable
pg_trgmSimilarity search for names, etcYes (GIN/GIST)Good
Full Text SearchNatural language searchYes (GIN/GIST)Good

In this article, we are going to learn about the inner workings of Full Text Search in Postgres and how to easily integrate Full Text Search into your Rails application using a fantastic gem named pg_search. We will learn how to search multiple columns at once, to give one column precedence over another, and how to optimize our Full Text Search implementation, taking a single query from 130ms to 7ms.

The full source code used in this article can be found here. Instructions on how to run this application locally and how to load the sample data referenced within this article can be found in the README.

If you are interested in efficient Full Text Search in Postgres with Django, you can read our article about it.

The Foundations of Full Text Search

Let's break down the basics of Full Text Search, defining and explaining some of the most common terms you'll run into. Taking the text “looking for the right words”, we can see how Postgres stores this data internally, using the to_tsvector function:

SELECT to_tsvector('english', 'looking for the right words');
-- 'look':1 'right':4 'word':5

In the above SQL we have some text; often referred to as a document when talking about Full Text Search. A document must be parsed and converted into a special data type called a tsvector, which we did using the function to_tsvector.

The tsvector data type is comprised of lexemes. Lexemes are normalized key words which were contained in the document that will be used when searching through it. In this case we used the english language dictionary to normalize the words, breaking them down to their root. This means that words became word, and looking became look, with very common words such as for and the being removed completely, to avoid false positives.

SELECT to_tsvector('english', 'looking for the right words') @@ to_tsquery('english', 'words');
-- TRUE

The @@ operator allows us to check if a query (data type tsquery) exists within a document (data type tsvector). Much like tsvector, tsquery is also normalized prior to searching the document for matches.

SELECT
  ts_rank(
    to_tsvector('looking for the right words'),
    to_tsquery('english', 'words')
   );
-- 0.06079271

The ts_rank function takes a tsvector and a tsquery, returning a number that can be used when sorting the matching records, allowing us to sort the results from highest to lowest ranking.

Now that you have seen a few examples, let’s have a look at one last one before getting to Rails. Following, you can see an example of a query which searches through the jobs table where we are storing the title and description of each job. Here we are searching for the words ruby and rails, grabbing the 3 highest ranking results.

SELECT
  id,
  title,
  ts_rank(
    to_tsvector('english', title) || to_tsvector('english', description),
    to_tsquery('english', 'ruby & rails')
  ) AS rank
FROM jobs
WHERE
  to_tsvector('english', title) || to_tsvector('english', description) @@
  to_tsquery('english', 'ruby & rails')
ORDER BY rank DESC
LIMIT 3

The highest ranking result is a job with the title "Ruby on Rails Developer"... perfect! The full results of this query are:

[
  {
    "id": 1,
    "title": "Ruby on Rails Developer",
    "rank": 0.40266925
  },
  {
    "id": 109,
    "title": "Senior Ruby Developer - Remote",
    "rank": 0.26552397
  },
  {
    "id": 151,
    "title": "Team-Lead Developer",
    "rank": 0.14533159
  }
]

This query is actually concatenating (using ||) two tsvector fields together. This allows us to search both the title and the description at the same time. Later, we'll see how to give additional weight (precedence) to the title column.

Implementing Postgres Full Text Search in Rails

With a basic understanding of Full Text Search under our belts, it's time to take our knowledge over to Rails. We will be using the pg_search Gem, which can be used in two ways:

  1. Multi Search: Search across multiple models and return a single array of results. Imagine having three models: Product, Brand, and Review. Using Multi Search we could search across all of them at the same time, seeing a single set of search results. This would be perfect for adding federated search functionality to your app.

  2. Search Scope: Search within a single model, but with greater flexibility.

We will be focusing on the Search Scope approach in this article, as it lets us dive into the configuration options available when working with Full Text Search in Rails. Let's add the Gem to our Gemfile and get started:

# Gemfile
gem 'pg_search', '~> 2.3', '>= 2.3.2'

With that done, we can include a module in our Job model, and define our first searchable field:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_title, against: :title
end

This adds a class level method to Job, allowing us to find jobs with the following line, which automatically returns them ranked from best match to worst.

Job.search_title('Ruby on Rails')

If we were to append to_sql to the above Ruby statement, we can see the SQL that is being generated. I have to warn you, it’s a bit messy, but that is because it handles not only searching, but also putting the results in the correct order using the ts_rank function.

SELECT
  "jobs".*
FROM
  "jobs"
  INNER JOIN (
    SELECT
      "jobs"."id" AS pg_search_id,
      (ts_rank((to_tsvector('simple', coalesce("jobs"."title"::text, ''))), (to_tsquery('simple', ''' ' || 'Ruby' || ' ''') && to_tsquery('simple', ''' ' || 'on' || ' ''') && to_tsquery('simple', ''' ' || 'Rails' || ' ''')), 0)) AS rank
    FROM
      "jobs"
    WHERE ((to_tsvector('simple', coalesce("jobs"."title"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'Ruby' || ' ''') && to_tsquery('simple', ''' ' || 'on' || ' ''') && to_tsquery('simple', ''' ' || 'Rails' || ' ''')))) AS pg_search_5d9a17cb70b9733aadc073 ON "jobs"."id" = pg_search_5d9a17cb70b9733aadc073.pg_search_id
ORDER BY
  pg_search_5d9a17cb70b9733aadc073.rank DESC,
  "jobs"."id" ASC

Configuring pg_search

There are a number of ways you can configure pg_search: From support for prefixes and negation, to specifying which language dictionary to use when normalizing the document, as well as adding multiple, weighted columns.

By default pg_search uses the simple dictionary, which does zero normalization, but if we wanted to normalize our document using the english dictionary, searching across both the title and description, it would look like:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: %i[title description],
                  using: { tsearch: { dictionary: 'english' } }
end

We can perform a search in the same way we did before: Job.search_job("Ruby on Rails"). If we wanted to give higher precedence to the title column, we can add weighting scores to each of the columns, with possible values of: A, B, C, D.

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: { title: 'A', description: 'B' },
                  using: { tsearch: { dictionary: 'english' } }
end

When you start combining columns, weighting them, and choosing which dictionary provides the best results, it really comes down to trial and error. Play around with it, try some queries and see if the results you get back match with what you are expecting!

Download Free eBook: Efficient Search in Rails with Postgres

Optimizing Full Text Search Queries in Rails

We have a problem! The query that is produced by Job.search_job("Ruby on Rails") takes an astounding 130ms. That may not seem like such a large number, but it is astounding because there are only 145 records in my database. Imagine if there were thousands! The majority of time is spent in the to_tsvector function. We can verify this by running this streamlined query below, which takes almost as much time to execute as the full query which actually finds the matching jobs:

SELECT to_tsvector('english', description) FROM jobs;
-- ~130ms

SELECT description FROM jobs;
-- ~15ms

This tells me that the slowness is in re-parsing and normalizing the document into a tsvector data type every single time the query is executed. The folks at thoughtbot have a great article about Full Text Search optimizations, where they add a pre-calculated tsvector column, keeping it up-to-date with triggers. This is great because it allows us to avoid re-parsing our document for every query and also lets us index this column!

There is a similar but slightly different approach I want to cover today which I learned by reading through the Postgres documentation. It also involves adding a pre-calculated tsvector column, but is done using a stored generated column. This means we don't need any triggers! It should be noted that this approach is only available in Postgres 12 and above. If you are using version 11 or earlier, the approach in the thoughtbot article is probably still the best one.

As we are venturing into the territory of more custom Postgres functionality, not easily supported by the Rails schema file in Ruby, we'll want to switch the schema format from :ruby to :sql. This line can be added to the application.rb file:

config.active_record.schema_format = :sql

Now, let's generate a migration to add a new column to the jobs table which will be automatically generated based on the setweight and to_tsvector functions:

class AddSearchableColumnToJobs < ActiveRecord::Migration[6.0]
  def up
    execute <<-SQL
      ALTER TABLE jobs
      ADD COLUMN searchable tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(description,'')), 'B')
      ) STORED;
    SQL
  end

  def down
    remove_column :jobs, :searchable
  end
end

Note that, as of the writing of this article, Postgres always requires a generated column to be a “Stored” column. That means it actually occupies space in your table and gets written on each INSERT/UPDATE. This also means that when you add a generated column to a table, it will require a rewrite of the table to actually set the values for all existing rows. This may block other operations on your database.

With our tsvector column added (which is giving precedence to the title over the description, is using the english dictionary, and is coalescing null values into empty strings), we're ready to add an index to it. Either GIN or GiST indexes can be used to speed up full text searches, but Postgres recommends GIN as the preferred index due to GiST searches being lossy, which may produce false matches. We'll add it concurrently to avoid locking issues when adding an index to large tables.

class AddIndexToSearchableJobs < ActiveRecord::Migration[6.0]
  disable_ddl_transaction!

  def change
    add_index :jobs, :searchable, using: :gin, algorithm: :concurrently
  end
end

The last thing we need to do is to tell pg_search to use our tsvector searchable column, rather than re-parsing the title and description fields each time. This is done by adding the tsvector_column option to tsearch:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: { title: 'A', description: 'B' },
                  using: {
                    tsearch: {
                      dictionary: 'english', tsvector_column: 'searchable'
                    }
                  }
end

With this optimization done, we have gone from around 130ms to 7ms per query... not bad at all!

Download Free eBook: Advanced Database Programming with Rails and Postgres

Conclusion

Let’s have a look at a real-life data set. We can prove the precision of our approach by looking at my database: Out of 145 jobs pulled from the GitHub and Hacker News job APIs, searching for "Ruby on Rails" returns the following results:

[
  "Ruby on Rails Developer",
  "Senior Ruby Developer - Remote",
  "Gobble (YC W14) – Senior Full Stack Software Engineers – Toronto, On",
  "DevOps (Remote - Europe)",
  "CareRev (YC S16) Is Hiring a Senior Back End Engineer in Los Angeles",
  "Software Engineer, Full Stack (Rails, React)",
  "Software Engineer",
  "Technology Solutions Developer"
]

To summarize:

We have shown how to use Postgres' Full Text Search within Rails and also how to customize it both in terms of functionality, but also in terms of performance. We ended up with a performant and flexible solution right inside the database we were already using.

Many use cases for Full Text Search can be implemented directly inside Postgres, avoiding the need to install and maintain additional services such as Elasticsearch.

If you find this article useful and want to share it with your peers you can tweet about it here.

You might also be interested in

Learn more about how to make the most of Postgres and Ruby on Rails:

About the author

Leigh Halliday is a guest author for the pganalyze blog. He is a developer based out of Canada who works at FlipGive as a full-stack developer. He writes about Ruby and React on his blog and publishes React tutorials on YouTube.

Enjoy blog posts like this?

Get them once a month to your inbox

Read the whole story
· · · · · · · · · · · · ·
emrox
30 days ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories
Loading...