3577 stories
·
2 followers

Most Common PIN Codes — Information is Beautiful

1 Share

Chip and PIN card, phone passcode, hotel safe – how predictable is your chosen PIN number?

3.4 million data points visualized from several data breaches.

Created by the late great Nick Berry of Data Genetics (redesigned and used with permission). He wrote a great data story around this which is also worth a read.

Published in our book, Knowledge is Beautiful.

Read the whole story
emrox
37 minutes ago
reply
Hamburg, Germany
Share this story
Delete

The Most Talented Person In The World

1 Share
Read the whole story
emrox
1 hour ago
reply
Hamburg, Germany
Share this story
Delete

Snappy UI Optimization with useDeferredValue

1 Share

useDeferredValue is one of the most underrated React hooks. It allows us to dramatically improve the performance of our applications in certain contexts. I recently used it to solve a gnarly performance problem on this blog, and in this tutorial, I'll show you how! ⚡



Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
Share this story
Delete

Bad Weather

2 Shares


View On WordPress

Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
ameel
11 days ago
reply
Melbourne, Australia
Share this story
Delete

I’m worried about the tabbing behaviour, rather than the syntax and name of CSS masonry

1 Share

Back in 2022 I made this site: Be the browser’s mentor, not its micromanager. There’s some key principles on there which is a nice little collection of tiles.

Dark grey tiles with white text arranged in a standard grid in Arc

The trick during the talk — that I made this site for — was that the grid is actually progressively enhanced with masonry where browsers support it, but no one in the audience would have known that had I not told them. It’s the magic of progressive enhancement: everyone gets a fantastic experience, so they don’t even consider if they are getting the “best” experience. They already are because everything works for them.

Anyway, this is how it looks in Safari Technology Preview. It’s subtle, but it’s a nice enhancement.

Dark grey tiles with white text arranged in a masonry grid in Safari, with a slightly modified source order

The way the layout works is there’s a flexible layout composition, aptly named .grid:

.grid {
  display: grid;
  grid-template-columns: repeat(
    var(--grid-placement, auto-fill),
    minmax(var(--grid-min-item-size, 16rem), 1fr)
  );
  gap: var(--gutter, var(--space-s-l));
}

Using a CUBE exception, I added a masonry enhancement:

.grid[data-rows='masonry'] {
  grid-template-rows: masonry;
  align-items: start;
}

The nice thing about this exception is yes, it slaps the masonry grid-template-rows value in, but also, aligns items to the start, so they at least only size vertically to the size of the content where masonry isn’t available.

The reason I chose this pattern was because I knew there would be no tabbing issues because it’s just headings and paragraphs with one link. I created a little demo to show you the problem with tabbing in the current iteration of masonry, available in Firefox and Safari Technology preview:

See the Pen Tabbing issues with Masonry by Andy Bell (@andy-set-studio) on CodePen.

For those of you without those browsers, here’s what it looks like.

Watch the video.

The tabbing order is wild — especially in Firefox. That’s sorta expected though because masonry layouts pack items into available space to get that stonework-like effect — hence the name masonry.

This is a real problem though because with one line of CSS you can create a pretty serious accessibility issue. I dunno how it would get fixed, so maybe the best thing is for me to warn you not to use masonry if there’s focusable elements in play.

Which opinion on syntax do I have?

WebKit asked for opinions and Google answered, so here’s my opinion. I honestly don’t mind either of their approaches. I was thinking a while ago masonry feels like a flexbox kind of deal because by nature of a masonry layout, it’s flexible, which to me screams flexbox. I am not smart enough for CSS specs though, so I'll take whatever I'm given as long as it works.

Masonry is already available as a grid value though. What do sites that already use that experimental value do? Sure it’s part of a one-liner, but Google’s suggestion certainly isn’t. It’s a whole layout system in itself, which is a hell of a refactor! On the flip-side, I’m sure we’d rather have agreed standards than potentially half-baked ideas.

To be honest, I think masonry as a design pattern is pretty darn antiquated. I liked the example that Jen used in the WebKit post — using masonry to tidy up one of those mega menus — but in reality, unless you’re building Pintrest or Unsplash-like photographic UIs, you can probably do better without masonry anyway, which begs the question: are there better things for the browsers to be focusing on?

Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
Share this story
Delete

GPUs Go Brrr

1 Share

AI uses an awful lot of compute.

In the last few years we’ve focused a great deal of our work on making AI use less compute (e.g. Based, Monarch Mixer, H3, Hyena, S4, among others) and run more efficiently on the compute that we have (e.g. FlashAttention, FlashAttention-2, FlashFFTConv). Lately, reflecting on these questions has prompted us to take a step back, and ask two questions:

  • What does the hardware actually want?
  • And how can we give that to it?

This post is a mixture of practice and philosophy. On the practical side, we’re going to talk about what we’ve learned about making GPUs go brr -- and release an embedded DSL, ThunderKittens, that we’ve built to help us write some particularly speedy kernels (which we are also releasing). On the philosophical side, we’ll briefly talk about how what we’ve learned has changed the way we think about AI compute.

What's in an H100?

For this post, we’re going to focus on the NVIDIA H100 for two reasons. First, it represents an awful lot of new compute going online. Second, we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too. But bear in mind (and we will repeat in case you forget) that most of this post applies in some form to other GPUs, too.

Advance apologies for restating the data sheet, but the details of the hardware are important for the discussion to come. An H100 SXM GPU contains, for our purposes:

  • 80 GB of HBM3 with 3 TB/s of bandwidth. (A bit less bandwidth in practice.)
  • 50 MB of L2 cache with 12 TB/s of bandwidth, split across the GPU into two 25MB sections connected by a crossbar. (The crossbar sucks.)
  • 132 streaming multiprocessors (SM’s), where each has:
    • up to 227 KB of shared memory within a 256 KB L1 cache. (Together, these have about 33 TB/s of bandwidth.)
    • a tensor memory accelerator (TMA) -- a new chunk of hardware in Hopper that can do asynchronous address generation and fetch memory. It also does other things like facilitate the on-chip memory network (distributed shared memory) but we’re not going to focus on this much, today.
    • 4 quadrants, where each quadrant has:
      • A warp scheduler
      • 512 vector registers (each containing 32 4-byte words)
      • A tensor core for matrix multiplies
      • A bunch of built-in instructions like sums, multiplies, that operate in parallel on these vector registers.

There’s a lot of other stuff, too (memory controllers, instruction caches, etc) but we don’t care about any of that right now.

All of the compute happens in the SM’s. Most of it happens in the registers.

Great, how do I make it go brr?

Keep the tensor core fed. That’s it.

Wait, really?

Yes. That’s the game.

An H100 GPU has 989 TFLOPs of half-precision matrix multiply compute, and ~60 TFLOPs of “everything else”. So, every cycle the tensor core is in use, you’re getting at least 94% utilization of the hardware. And every cycle the tensor core is not in use, you’re getting no more than 6% utilization of the hardware. Put another way:

% utilization H100 = % tensor cores active cycles +/- 6%.

Now it turns out that keeping the tensor core fed is easier said than done. We’ve discovered a number of quirks to the hardware that are important to keeping the matrix multiplies rolling. Much of this also applies to non-H100 GPUs, but the H100 is particularly tricky to keep fed so we focus on it here. (The RTX 4090, by comparison, is very easy to work with as illustrated in figure 2.)

  • WGMMA instructions are necessary but also really irritating to use.
  • Shared memory is not actually that fast and also requires great care.
  • Address generation is expensive.
  • Occupancy remains helpful, and registers are generally the key resource.

Let’s go through each of these in order.

WGMMA Instructions

The H100 has a new set of instructions called “warp group matrix multiply accumulate” (wgmma.mma_async in PTX, or HGMMA/IGMMA/QGMMA/BGMMA in SASS). To understand what makes them special, we need to look briefly at how you used to have to use tensor cores. The tensor core instructions available on previous GPUs were wmma.mma.sync and mma.sync instructions. With these instructions a warp of 32 threads on a single quadrant of an SM would synchronously feed their chunk of the data into the tensor core and await the result. Only then could they move on.

Not so with wgmma.mma_async instructions. Here, 128 consecutive threads -- split across all quadrants of the SM -- collaboratively synchronize, and asynchronously launch a matrix multiply directly from shared memory (and optionally also registers.) These warps can then go do other things with their registers while the matrix multiply happens, and await the result whenever they want.

In our microbenchmarks, we found that these instructions are necessary to extract the full compute of the H100. Without them, the GPU seems to top out around 63% of its peak utilization; we suspect this is because the tensor cores want a deep hardware pipeline to keep them fed, even from local resources.

Unfortunately, the memory layouts for these instructions are quite complicated. The unswizzled shared memory layouts suffer from very poor coalescing, and so they require substantial additional bandwidth from L2. The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out. They’re also brittle, in that they appear to only work for specific matrix shapes and do not play well with other parts of the wgmma.mma_async instructions. For example, the hardware can transpose sub-matrices on its way to the tensor cores -- but only if the layout is not swizzled.

We’ve also found that unswizzled wgmma layouts have both poor memory coalescing as well as bank conflicts. On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided, and then controlling layouts very carefully becomes critical.

Despite these pains, these instructions really are necessary to make full use of the H100. Without them, you’ve already lost 37% of the potential performance of the GPU!

Shared memory

Shared memory appears to have a single-access latency of around 30 cycles (this matches our observations, too). That doesn’t sound like much, but in that time the SM’s tensor cores could have done almost two full 32x32 square matrix multiplies.

In previous work (like Flash Attention), we’ve focused more on the HBM-SRAM bottleneck. And indeed: this really used to be the bottleneck! But as HBM has gotten faster and the tensor cores continue to grow out of proportion with the rest of the chip, even relatively small latencies like those from shared memory have also become important to either remove or hide.

Shared memory can be tricky to work with because it is “banked” into 32 separate stores of memory. If one is not careful, this can lead to something called “bank conflicts”, where the same memory bank is being asked to simultaneously provide multiple different pieces of memory. This leads to requests being serialized, and in our experience this can disproportionately slow down a kernel -- and the register layouts required by wgmma and mma instructions would naively suffer from these bank conflicts. The solution is to rearrange shared memory with various “swizzling” patterns so as to avoid these conflicts, but it is an important detail to get right.

More generally, we have found it very valuable to avoid movement between registers and shared memory when possible, and otherwise to use the built-in hardware (wgmma and TMA instructions) to do data movement asynchronously when possible. Synchronous movement using the actual warps is a worst-case fallback with the greatest generality.

Address Generation

One interesting quirk of the H100 is that the tensor cores and memory are both fast enough that merely producing the memory addresses to fetch takes a substantial fraction of the resources of the chip. (This is even more the case when complicated interleaved or swizzling patterns are added in.)

NVIDIA appears to understand this, as they have bestowed on us the Tensor Memory Accelerator (or TMA, as it likes to be called). TMA allows you to specify a multi-dimensional tensor layout in global and shared memory, tell it to asynchronously fetch a subtile of that tensor, and trip a barrier when it’s done. This saves all of the address generation costs, and additionally makes it much easier to construct pipelines.

We have found TMA to be, like wgmma.mma_async, completely indispensable in achieving the full potential of the H100. (Probably moreso than wgmma, in our experience.) It saves register resources and instruction dispatches, and also has useful features such as the ability to perform reductions onto global memory asynchronously, too -- this is particularly useful in complex backwards kernels. As with wgmma, the main quirk of it is that its swizzling modes are a bit difficult to decipher without some reverse engineering, but we had substantially less pain on this point.

Occupancy

For those newer to CUDA, occupancy refers to the number of co-scheduled threads on the exact same execution hardware. Each cycle, the warp scheduler on that quadrant of the SM will try to issue an instruction to a warp of threads that are ready for an instruction. NVIDIA uses this model because it can enable the hardware to be more easily kept full. For example, while one warp of threads is waiting for a matrix multiply, another can receive an instruction to use the fast exponential hardware.

In some ways, the H100 is less reliant on occupancy than previous generations of the hardware. The asynchronous features of the chip mean that even a single instruction stream can keep many parts of the hardware busy -- fetching memory, running matrix multiplies, doing shared memory reductions, and still simultaneously running math on the registers.

But occupancy is very good at hiding both sins and sync’s. A perfectly designed pipeline might run reasonably fast even without any additional occupancy, but our observations suggest that NVIDIA really has designed their GPUs with occupancy in mind. And there are enough synchronizations -- and enough ways to make mistakes -- that finding ways to increase occupancy has, in our experience, usually yielded good returns at increasing the realized utilization of the hardware.

Finally, while occupancy is merely useful on the H100, we have found it to be increasingly important on the A100 and RTX 4090, respectively, likely because they rely increasingly on synchronous instruction dispatches, relative to the H100.

ThunderKittens

Based on the above, we asked ourselves how we might make it easier to write the kinds of kernels we care about while still extracting the full capabilities of the hardware. Motivated by a continuing proliferation of new architectures within the lab (and the fact that Flash Attention is like 1200 lines of code), we ended up designing a DSL embedded within CUDA -- at first for our own internal use.

But then we decided it was useful enough that, with love in our hearts, we cleaned it up and have released it for you. ThunderKittens is that embedded DSL. It is named ThunderKittens because we think kittens are cute, and also we think it is funny to make you type kittens:: in your code.

It is meant to be as simple as possible, and contains four templated types:

  • Register tiles -- 2D tensors on the register file.
  • Register vectors -- 1D tensors on the register file.
  • Shared tiles -- 2D tensors in shared memory.
  • Shared vectors -- 1D tensors in shared memory.

Tiles are parameterized by a height, width, and layout. Register vectors are parameterized by a length and a layout, and shared vectors just by a length. (They don’t generally suffer from bank conflicts.)

We also give operations to manipulate them, either at the warp level or at the level of a collaborative group of warps. Examples include:

  • Initializers -- zero out a shared vector, for example.
  • Unary ops, like exp
  • Binary ops, like mul
  • Row / column ops, like a row_sum

Since ThunderKittens is embedded within CUDA (contrasting libraries like Triton which we also love very much and rely on heavily), the abstractions fail gracefully. If it’s missing something, just extend it to do what you want!

To show an example of these primitives in action, consider Tri’s lovely flash attention -- a beautiful algorithm, but complicated to implement in practice, even on top of NVIDIA’s wonderful Cutlass library.

Here's a simple forward flash attention kernel for an RTX 4090, written in ThunderKittens.

Altogether, this is about 60 lines of CUDA sitting at 75% hardware utilization -- and while it is fairly dense, most of the complexity is in the algorithm, rather than in swizzling patterns or register layouts. And what of all of the complexity of TMA, WGMMA, swizzling modes, and descriptors? Here’s a FlashAttention-2 forward pass for the H100, written with ThunderKittens.

So how does it do?

This kernel is just 100 lines, and it actually outperforms FlashAttention-2 on the H100 by about 30%. ThunderKittens takes care of wrapping up the layouts and instructions, and gives you a mini-pytorch to play with on the GPU.

We also release kernels for Based linear attention and other forthcoming architectures, too. Our Based linear attention kernel runs at 215 TFLOPs (or more than 300 TFLOPs when the recompute inherent in the algorithm is considered). And while linear attention is of course theoretically more efficient, historically, they have been dramatically less efficient on real hardware. So we feel this could open up a broad range of high-throughput applications -- more to come on this point later.

If this seems up your alley, feel free to play with it!

Tiles Seem Like a Good Idea

In our view, what has made ThunderKittens work well for us is that it does not try to do everything. CUDA is indeed far more expressive than ThunderKittens. ThunderKittens is small and dumb and simple.

But ThunderKittens has good abstractions -- small tiles -- that match where both AI and hardware are going. ThunderKittens doesn’t support any dimension less than 16. But in our view, this doesn’t really matter, since the hardware doesn’t particularly want to, either. And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this -- after all this time, it’s still just matrix multiplies, reductions, and reshapes. And we think the hardware wants this, too -- small matrix multiplies are just begging for hardware support beyond just the systolic mma.

In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match.

Tiles Seem Pretty General

Coming soon -- ThunderKittens on AMD hardware!

Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories