3647 stories
·
3 followers

Building a WoW server in Elixir

1 Share

Thistle Tea is my new World of Warcraft private server project. You can log in, create a character, run around, and cast spells to kill mobs, with everything synchronized between players as expected for an MMO. It was floating around in my head to build this for a while, since I have an incurable nostalgia for early WoW. I first mentioned this on May 13th and didn’t expect to get any further than login, character creation, and spawning into the map. Here’s a recount of the first month of development.

Prep Work

Before coding, I did some research and came up a plan.

  • Code this in Elixir, using the actor model.
  • MaNGOS has done the hard work on collecting all the data, so use it.
  • Use Thousand Island as the socket server.
  • Treat this project as an excuse to learn more Elixir.
  • Reference existing projects and documentation rather than working from scratch.
  • Speed along the happy path rather than try and handle every error.

Day 1 - June 2nd

There are two main parts to a World of Warcraft server: the authentication server and the game server. Up first was authentication, since you can’t do anything without logging in.

To learn more about the requests and responses, I built a quick MITM proxy between the client and a MaNGOS server to log all packets. It wasn’t as useful as expected, since not much was consistent, but it did help me internalize how the requests and responses worked.

The first byte of an authentication packet is the opcode, which indicates which message it is, and the rest is a payload with the relevant data. I was able to extract the fields from the payload by pattern matching on the binary data.

The auth flow can be simplified as:

  • Client sends a CMD_AUTH_LOGON_CHALLENGE packet
  • Server sends back some data for the client to use in crypto calculations
  • Client sends a CMD_AUTH_LOGON_PROOF packet with the client proof
  • If the client proof matches what’s expected, server sends over the server_proof
  • Client is now authenticated

It uses SRP6, which I hadn’t heard of before this. Seems like the idea is to avoid transmitting an unencrypted password and instead both the client and server independently calculate a proof that only matches if they both know the correct password. If the proof matches, then authentication is successful.

So basically, what I needed to do was:

  • Listen for data over the socket
  • Once data received, parse what message it is out of the header section
  • Handle each one differently
  • Send back the appropriate packet

This whole part is well documented, but I still ran into some issues with the cryptography. Luckily, I found a blog post and an accompanying Elixir implementation, so I was able to substitute my broken cryptography with working cryptography. Without that, I would’ve been stuck at this part for a very long time (maybe forever). Wasn’t able to get login working on day 1, but I was close.

Links:

Day 2 - June 3rd

I spent some time cleaning up the code and found a logic error where I reversed some crypto bytes that weren’t supposed to be. Fixing that made auth work, finally getting a success with hardcoded credentials.

Next up was getting the realm list to work, by handling CMD_REALM_LIST and returning which game server to connect to.

This got me out of the tedious auth bits and I could get to building the game server.

Links:

Day 3 - June 4th

The goal for today was to get spawned into the world. But first more tedious auth bits.

The game server auth flow can be simplified as:

  • When client first connects, server sends SMSG_AUTH_CHALLENGE, with a random seed
  • Client sends back CMSG_AUTH_SESSION, with another client proof
  • If client proof matches server proof, server sends a successful SMSG_AUTH_RESPONSE

This negotaties how to encrypt/decrypt future packet headers. Luckily Shadowburn also had crypto code for this, so I was able to use it here. The server proof requires a value previously calculated by the authentication server, so I used an Agent to store that session value. It worked, but I later refactored it to use ETS for a simpler interface.

After that, it’s something like:

  • Client sends message to server
  • Server decrypts header, which contains message size (2 bytes) and opcode (4 bytes)
  • Server handles message and sends back 0 or more messages to client

First was handling CMSG_CHAR_CREATE and CMSG_CHAR_ENUM, so I could create and list characters. I originally used an Agent for storage here as well, which made things quick to prototype.

Then I got side-tracked for a bit trying to get equipment to show up, since I had all the equipment display ids hardcoded to 0. I looked through the MaNGOS database and hardcoded a few proper display ids before moving on.

After that was handling CMSG_PLAYER_LOGIN. I found an example minimal SMSG_UPDATE_OBJECT spawn packet, which was supposed to spawn me in Northshire Abbey.

That’s probably the most important packet, since it does everything from:

  • Spawning things into the world, like players, mobs, objects, etc.
  • Updating their values, like health, position, level, etc.

It has a lot of different forms, can batch multiple object updates into a single packet, and has a compressed variant.

Whoops, had the coordinates a bit off.

After fixing that, I was in the human starting area as expected. No player model yet, though.

I thought movement was broken, but it turns out all keybinds were being unset on every login, so the movement keys just weren’t bound. Manually navigating to the keybinding configuration and resetting them to default allowed me to move around.

Next up was adding more to that spawn packet to use the player race and proper starting area. The starting areas were grabbed from a MaNGOS database that I converted over to SQLite and wired up with Ecto.

Last for the night was to get logout working.

The implementation was something like:

  • After receiving a CMSG_LOGOUT_REQUEST, use Process.send_after/3 to queue a :login_complete message that would send SMSG_LOGOUT_COMPLETE to the client in 20 seconds
  • Store a reference to that timer in state
  • If received a CMSG_LOGOUT_CANCEL, cancel the timer and remove it from state

This was the first piece that really took advantage of Elixir’s message passing.

The white chat box was weird, but it was nice being able to log in.

Links:

Day 4 - June 5th

First up was reorganizing the code, since my game.ex GenServer was getting too large.

My strategy for that was:

  • Split out related messages into separate files
    • auth.ex, character.ex, ping.ex, etc.
    • wrapped in the __using__ macro
  • Include those back into game.ex with use

It worked, but it messed with line numbers in error messages and made things harder to debug.

After that, I wanted to generate that spawn packet properly rather than hardcoding. The largest piece of this was figuring out the update mask for the update fields.

There are a ton of fields for the different types of objects SMSG_UPDATE_OBJECT handles. Before the raw object fields in the payload, there’s a bit mask with bits set at offsets that correspond to the fields being sent. Without that, the client wouldn’t know what to do with the values.

So, I needed to write a function that would generate this bit mask from the fields I pass in. Luckily it’s all well documented, but it still took a while to get to a working implementation.

Links:

Day 5 - June 6th

Referencing MaNGOS, I added some more messages that the server sends to the client after a CMSG_PLAYER_LOGIN. One of these, SMSG_ACCOUNT_DATA_TIMES, fixed the white chat box and keybinds being reset.

I also added SMSG_COMPRESSED_UPDATE_OBJECT, which compresses the SMSG_UPDATE_OBJECT packet with :zlib.compress/1. This was more straightforward than expected, and I made things use the compressed variant if it’s actually smaller. I’m expecting this to have even more benefits when I get to batching object updates, but right now I’m only updating objects one by one.

Movement would come up soon, so I started adding the handlers for those packets.

Day 6 - June 7th

In the update packet, I still had the object guid hardcoded. This is because it wants a packed guid and I needed to write some functions to handle that. Rather than the entire guid, a packed guid is a byte mask followed by all non-zero bytes. The byte mask has bits set that correspond to where the following bytes go in the unpacked guid. This is for optimizing packet size, since a guid is always 8 bytes but a packed guid can be as small as 2 bytes.

This took a while, because the client was crashing when I changed the packed guid from <<1, 4>> to anything else. After trying different things and wasting a lot of time, I realized that the guid was in two places in the packet and they needed to match. A quick fix later and things were working as expected.

Links:

Day 7 - June 8th

It was about time to start implementing the actual MMO features, starting with seeing other players. To test, I hardcoded another update packet after the player’s with a different guid, to try and spawn something other than the player.

Then I used a Registry to keep track of logged in players and their spawn packets. After entering the world, I would use Registry.dispatch/3 to:

  • spawn all logged in players for that player
  • spawn that player for all other players
  • both using SMSG_UPDATE_OBJECT

After that, I added a similar dispatch when handling movement packets to broadcast movement to all other players. This is where the choice of Elixir really started to shine, and I quickly had players able to see each other move around the screen.

I tested this approach with multiple windows open and it was very cool to see everything synchronized.

I added a handler for CMSG_NAME_QUERY to get names to stop showing up as Unknown and also despawned players with SMSG_DESTROY_OBJECT when logging out.

This is where I started noticing a bug: occasionally I wouldn’t be able to decrypt a packet successfully, which would lead to all future attempts for that connection failing too, since there’s a counter as part of the decryption function. I couldn’t figure out how to resolve it yet, though, or reliably reproduce.

Links:

Day 8 - June 9th

To get chat working, I handled CMSG_MESSAGECHAT and broadcasted SMSG_MESSAGECHAT to players, using Registry.dispatch/3 here too. I only focused on /say here and it’s all players rather than nearby. Something to fix later.

Related to that weird decryption bug, I handled the case where the server received more than one packet at once. This might’ve helped a bit, but didn’t completely resolve the issue.

Links:

Day 9 - June 10th

I still had authentication with a hardcoded username, password, and salt, so it was about time to fix that. Rather than go with PostgreSQL or SQLite for the database, I decided to go with Mnesia, since one of my goals was to learn more about Elixir and its ecosystem. I briefly tried plain :mnesia, but decided to use Memento for a cleaner interface.

So, I added models for Account and Character and refactored everything to use them. The character object is kept in process state and only persisted to the database on logout or disconnect. Saving on a CMSG_PING or just periodically could be a good idea too, eventually. Right now data isn’t persisted to disk, since I’m still iterating on the data model, but that should be straightforward to toggle later.

Links:

Day 10 - June 11th

Today was standardizing the logging, handling a bit more of chat, and handling an unencrypted CMSG_PING. I was thinking that could be part of the intermittent decryption issues too, but looking back I don’t think I’ve ever had my client send that unencrypted anyways.

Day 11 - June 12th

I wanted equipment working so players weren’t naked all the time, so I started on that. Using the MaNGOS item_template table, I wired things up to set random equipment on character creation. Then I added that to the response to CMSG_CHAR_ENUM so they would show up in the login screen.

Up next was getting it showing in game.

Day 12 - June 13th

It took a bit to figure out the proper offsets for each piece of equipment in the update mask, but I eventually got it working.

Since equipment is part of the update object packet, it just worked for other players, which was nice.

Day 13 - June 14th

I had player movement synchronizing between players properly so I wanted to get sitting working too.

Whoops. Weird things happen when field offsets or sizes are incorrect when building that update mask.

After that, I wanted to play around a bit by randomizing equipment on every jump. Here I learned that you need to send all fields in the update object packet, like health, or they get reset. I was trying to just send the equipment changes but I’d die on every jump.

After making sure to send all fields, it was working as expected.

Day 14 - June 15th

Took a break.

Day 15 - June 16th

Today was refactoring and improvements. I reworked things into proper modules, since it was getting hard to debug when all the line numbers were wrong. Now game.ex called the appropriate module’s handle_packet/3 function, rather than combining everything with use.

I also reworked things so players were spawned with their current position instead of the initial position saved in the registry. This included some changes to make building an update packet more straightforward.

Day 16 - June 17th

Today was just playing around and no code changes.

Not sure why the model is messed up here, but it seems like it’s something with my computer rather than anything server related.

Day 17 - June 18th

The world was feeling a bit empty, so I wanted to spawn in mobs. First was hardcoding an update packet that should spawn a mob and having it trigger on /say.

After that, I used the creature table of the MaNGOS database to get proper mobs spawning. I used a GenServer for this so every mob would be a process and keep track of their own state. Communication between mobs and players would happen through message passing. First I hardcoded a few select ids in the starting area to load, and after that worked I loaded them all.

Rather than spawn all ~57k mobs for the player, I wired things up to only spawn mobs within a certain range. This looked like:

  • Store mob pids in a Registry, along with their x, y, z position
  • Create a within_range/2 function that takes in two x, y, z tuples
  • On player login, dispatch on that MobRegistry, using within_range/2 to only build spawn packets for mobs within range
  • On player movement, do the same

It worked really well and I could run around and see the mobs.

Next up was optimization and despawning mobs that were now out of range.

Day 18 - June 19th

For optimization, I didn’t want to send duplicate spawn packets for mobs that were already spawned. I also wanted to despawn mobs that were out of range. To do this, I used ETS to track which guids were spawned for a player.

In the dispatch, the logic was:

  • if in_range and not spawned, spawn
  • if not in_range and spawned, despawn
  • otherwise, ignore

Despawning was done through the same SMSG_DESTROY_OBJECT packet used for despawning a player after logging out.

After getting that working, I ran around the world and explored for a bit.

I noticed something wrong when exploring Westfall. Bugs were spawning in the air and then falling down to the ground. Turns out I wasn’t separating mobs by map, so Westfall had mobs from Silithus mixed in. To fix, I reworked both the mob and player registries to use the map as the key.

Having mobs standing in place was a bit boring and I wanted them to move around. Turns out this is pretty complicated and I’ll actually have to use the map files to generate paths that don’t float or clip through the ground. There are a few projects for this, all a bit difficult to include in an Elixir project. I’m thinking RPC could work, but not sure if it’ll be performant enough yet.

The standard update object packet can be used for mob movement here, since it has a movement block, but there might be some more specialized packets to look into later too.

Without using the map data, I couldn’t get the server movement to line up with what happened in the client. So, I settled with getting mobs to spin at random speeds.

That was a bit silly and used a lot of CPU, so I tweaked it to just randomly change orientation instead.

Links:

Day 19 - June 20th

Here I got mob names working by implementing CMSG_CREATURE_QUERY. This crashed the client when querying mobs that didn’t have a model, so I removed them from being loaded. I also started loading in mob movement data and optimized the query a bit to speed up startup.

I finally got some people to help me test the networking later that day. It didn’t start very well.

Turns out I hadn’t tested this locally since adding mobs and the player/mob spawn/despawns were conflicting with each other due to guid collisions. Players were being constantly spawned in and out.

I did some emergency patching to make it so players are never despawned, even out of range. I also turned off /say spawning boars since that was getting annoying. That worked for now.

There were still some major issues. My helper had 450 ms latency and would crash when running to areas with a lot of mobs. I couldn’t reproduce, though, with my 60 ms latency.

Links:

Day 20 - June 21

To reproduce the issue from the previous night, I connected to my local server from my laptop on the same network. On my laptop, I used tc to simulate a ton of latency and wired things up so equipment would change on any movement instead of just jump. This sent a lot of packets when spinning and I was finally able to reproduce.

Turns out the crashing issues were from not receiving a complete packet, but still trying to decrypt and handle it. I was handling if the server got more than one packet, but not if the server got a partial packet.

Referencing Shadowburn’s implementation, the fix for this was to let the packet data accumulate until there’s enough to handle. This finally resolved the weird decryption issue I ran into on day 7.

For the guid collision issue, I added a large offset to creature guids so they won’t conflict with player guids.

Day 21 - June 22

Took a break.

Day 22 - June 23

Worked on CMSG_ITEM_NAME_QUERY a bit, but there’s still something wrong here. It could be that it’s trying to calculate damage using some values I’m not passing to the client yet.

Decided spells would be next, so I started on that. First was sending spells over with SMSG_INITIAL_SPELLS on login, using the initial spells in MaNGOS, so I’d have something in the spellbook. Everything was instant cast though, for some reason.

Turns out I needed to set unit_mod_cast_speed in the player update packet for cast times to show up properly in the client.

I started by handling CMSG_CAST_SPELL, which would send a successful SMSG_CAST_RESULT after the spell cast time, so other spells could be cast. I also handled CMSG_CANCEL_CAST, to cancel that timer. This implementation looked a bit like the logout logic.

The starting animation for casting a spell would play, but no cast bar or anything further.

Links:

Days 23 to 26 - June 24 to 27

Took a longer break.

Day 27 - June 28

I was able to get a cast bar showing up by sending SMSG_SPELL_START after receiving the cast spell packet.

The projectile effect took a bit longer to figure out. I needed to send a SMSG_SPELL_GO after the cast was complete, with the proper target guids.

Links:

Day 28 - June 29

I got self-cast spells working by setting the target guid to the player’s guid.

Day 29 - June 30

Another break.

Day 30 - July 1

Since I had spells somewhat working, next I had to clean up the implementation. I dispatched the SMSG_SPELL_START and SMSG_SPELL_GO packets to nearby players and fixed spell cancelling.

Day 31 - July 2

I added levels to mobs, random from their minimum to maximum level, previously hardcoded 1. Then I made spells do some hardcoded damage, so mobs could die. Mobs would still randomly change orientation when dead, so added a check to only move if alive.

That seemed like a good stopping point and was one month since I started writing code for the project.

Future Plans

I’ll slowly work on this, adding more functionality as I go. My goal isn’t a 1:1 Vanilla server, but more something that fits well with Elixir’s capabilities, so I don’t plan on accepting limitations for the sake of accuracy or similar. I’d like to see how many players this approach can handle and how it compares in performance to other implementations eventually too.

Some things on the list:

  • proper mob + player stats
  • proper damage calculations
  • pvp
  • quests
  • mob movement + combat ai
  • loot + inventory management
  • more spells + effects
  • tons of refactoring
  • benchmarking
  • gameplay loop, in general

So still plenty more work to do. :)

Thanks to all the projects I’ve referenced for this, most of which I’ve tried to link here.

I wouldn’t have gotten very far without them and their awesome documentation.


Read the whole story
emrox
2 days ago
reply
Hamburg, Germany
Share this story
Delete

You Can Do It

1 Share

You Can Do It

I believe in you! Here’s more motivation.

Read the whole story
emrox
2 days ago
reply
Hamburg, Germany
Share this story
Delete

Hacking sales as an introvert

1 Share
Read the whole story
emrox
3 days ago
reply
Hamburg, Germany
Share this story
Delete

Opinions for Writing Good CSS

1 Share
Read the whole story
emrox
5 days ago
reply
Hamburg, Germany
Share this story
Delete

Don’t use complex expressions in if conditions

1 Share

Volodymyr Gubarkov

Stand With Ukraine

August 2024

Let’s consider a piece of code below. It belongs to a notification sub-system of some hypothetical application. The code determines if the notification should be sent to any particular user or not.

if ((((reservationId && notification.reservationId == reservationId)
    || (facilityId && notification.facilityId in facilityId)
    || (hotelIds && hotelIds.contains(notification.hotelId)) && (hotelUser && notification.type.toAllHotelUsers || reservationId && notification.type.toAllReservations))
    || (isAdmin && hotelIds.contains(notification.hotelId))
    && (userId != notification.authorId || notification.authorId == null))) 
{
    send(notification)
}

The code above is absolutely incomprehensible. Let’s make it better:

boolean reservationMatches = reservationId && notification.reservationId == reservationId
boolean facilityMatches = facilityId && notification.facilityId in facilityId
boolean hotelMatches = hotelIds && hotelIds.contains(notification.hotelId)
boolean addressedToAll = hotelUser && notification.type.toAllHotelUsers || reservationId && notification.type.toAllReservations
boolean shouldSendByHotel = hotelMatches && (addressedToAll || isAdmin)
boolean senderIsNotReceiver = userId != notification.authorId || notification.authorId == null
boolean notificationMatchesUser = senderIsNotReceiver && (reservationMatches || facilityMatches || shouldSendByHotel)

if (notificationMatchesUser) {
    send(notification)
}

What did we do? We split the complex expression to sub-expressions by giving them meaningful names.

This way the code is much easier to maintain and reason about.

For a complex if condition it’s hard to reason if the condition is correct (and exhaustive) in a sense of complying to the business requirements.

In the first piece of code above we see that it sends a notification under some conditions. But what are those conditions and if they satisfy the business needs is hard to tell.

In the refactored code we clearly see that the notification is only sent when it matches the user (business requirement). And “matches the user” means sender is not receiver and either the reservation (of user) matches or the facility (of user) matches or sending is favored by the hotel (of user). And so on.

Every time you assign a name to something you have a chance to think if the name describes that “something” correctly. So by just doing this rewrite you can identify the bug.

Additionally, the refactored code is much easier to debug. When the if condition appears to be incorrect, you just put a breakpoint, and you immediately see the actual values of all sub-expressions. Therefore, you easily see which sub-expression gives incorrect result.

💡 TIP

The rule of thumb would be that ideally you should not have || or && in your if conditions.

It may be OK, though, for trivial cases.

Any of the following is equally good:

if (notificationMatchesUser(notification, reservationId, facilityId, hotelIds, userId)) {
    send(notification)
}
if (notification.matches(reservationId, facilityId, hotelIds, userId)) {
    send(notification)
}

If you noticed a typo or have other feedback, please email me at <a href="mailto:xonixx@gmail.com">xonixx@gmail.com</a>

Read the whole story
emrox
8 days ago
reply
Hamburg, Germany
Share this story
Delete

Elasticsearch is Open Source, Again

1 Share

[D.N.A] Elasticsearch and Kibana can be called Open Source again. It is hard to express how happy this statement makes me. Literally jumping up and down with excitement here. All of us at Elastic are. Open source is in my DNA. It is in Elastic DNA. Being able to call Elasticsearch Open Source again is pure joy.

[LOVE.] The tl;dr is that we will be adding AGPL as another license option next to ELv2 and SSPL in the coming weeks. We never stopped believing and behaving like an open source community after we changed the license. But being able to use the term Open Source, by using AGPL, an OSI approved license, removes any questions, or fud, people might have.

[Not Like Us] We never stopped believing in Open Source at Elastic. I never stopped believing in Open Source. I’m going on 25 years and counting as a true believer. So why the change 3 years ago? We had issues with AWS and the market confusion their offering was causing. So after trying all the other options we could think of, we changed the license, knowing it would result in a fork of Elasticsearch with a different name and a different trajectory. It’s a long story.

[Like That] The good news is that while it was painful, it worked. 3 years later, Amazon is fully invested in their fork, the market confusion has been (mostly) resolved, and our partnership with AWS is stronger than ever. We were even named AWS partner of the year. I had always hoped that enough time would pass that we could feel safe to get back to being an Open Source project - and it finally has.

[All The Stars] We want to make the life of our users as simple as possible. We have people that really like ELv2 (a BSD inspired license). We have people that have SSPL approved (through MongoDB using it). Which is why we are simply adding another option, and not removing anything. If you already use and enjoy Elasticsearch, please carry on, nothing changed. For others, you now have the option to choose AGPL as well.

[LOYALTY.] We chose AGPL, vs another license, because we hope our work with OSI will help to have more options in the Open Source licensing world. And it seems like another OSI approved license will rhyme with SSPL and/or AGPL. Heck, maybe AGPL is enough for infrastructure software like us with how things have progressed since we had to change the license (for example, Grafana who moved to it from Apache2). We are committed to figure it out.

[euphoria] I am so happy to be able to call Elasticsearch Open Source again.

[Alright] With any change, there can be confusion, and, of course, there can be trolls. (Aren’t there always trolls?) Let’s have some fun and try to answer some of these.. Here are some I can imagine, but let’s keep adding to this.

  • “Changing the license was a mistake, and Elastic now backtracks from it”. We removed a lot of market confusion when we changed our license 3 years ago. And because of our actions, a lot has changed. It’s an entirely different landscape now. We aren’t living in the past. We want to build a better future for our users. It’s because we took action then, that we are in a position to take action now.
  • “AGPL is not true open source, license X is”: AGPL is an OSI approved license, and it's a widely adopted one. For example, MongoDB used to be AGPL and Grafana is AGPL. It shows that AGPL doesn’t affect usage or popularity. We chose AGPL because we believe it’s the best way to start to pave a path, with OSI, towards more Open Source in the world, not less.
  • “Elastic changes the license because they are not doing well” - I will start by saying that I am as excited today as ever about the future of Elastic. I am tremendously proud of our products and our team's execution. We shipped Stateless Elasticsearch, ES|QL, and tons of vector database/hybrid search improvements for GenAI use cases. We are leaning heavily into OTel in logging and Observability. And our SIEM product in Security keeps adding amazing features and it's one of the fastest growing in the market. Users' response has been humbling. The stock market will have its ups and downs. What I can assure you, is that we are always thinking long term, and this change is part of it.

If we see more, we will add them above to hopefully reduce confusion.

[HUMBLE.] It’s so exciting to build for the future. Elasticsearch is back to being Open Source. Yay! What a wonderful thing to say. What a wonderful day.

Forever :elasticheart: elasticheart.svg Open Source
Shay

Read the whole story
emrox
8 days ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories