TurboQuant: Redefining AI efficiency with extreme compression

(research.google)

121 points | by ray__ 3 hours ago

7 comments

  • amitport 44 minutes ago
    This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.
  • benob 1 hour ago
    This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.
    • BenoitP 29 minutes ago
      It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma.
    • spencerflem 1 hour ago
      I think it is though-

      “ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.”

      • integralid 22 minutes ago
        I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed.
      • benob 1 hour ago
        Maybe they quantized a bit too much the model parameters...
  • bluequbit 1 hour ago
    I did not understand what polarQuant is.

    Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

    • Maxious 1 hour ago
      • spencerflem 1 hour ago
        I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused?
        • vincnetas 52 minutes ago
          i think grid can be a surface of the unit sphere
    • mrugge 1 hour ago
      1. Efficient recursive transform of kv embeddings into polar coordinates 2. Quantize resulting angles without the need for explicit normalization. This saves memory via key insight: angles follow a distribution and have analytical form.
      • quotemstr 1 hour ago
        Reminds me vaguely of Burrows-Wheeler transformations in bzip2.
  • moktonar 59 minutes ago
    Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?
    • amitport 57 minutes ago
      r is a single value per vector. You don't have to quantize it, you can keep it and quantize the billion+ other coordinates of the vector.
      • mungoman2 19 minutes ago
        What they're saying is that the error for a vector increases with r, which is true.

        Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector.

        • amitport 5 minutes ago
          The important part is that th normalized error does not increase with the dimension of the vector (which does happen when using biased quantizers)

          Big vectors have proportionally bigger error is expected, nothing can be done by the quantizer about that (this is proven for optimal vector quantization)

  • maurelius2 45 minutes ago
    I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?
    • valine 4 minutes ago
      So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps).

      The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.

    • dryarzeg 14 minutes ago
      If in short, for many inference tasks the bottleneck is memory bandwidth. Suppose you have a machine with a memory bandwidth of 256 GB/s, and let's say you want to do inference for 4B model (model with 4 billion parameters). If you will load the model in BF16 format (16 bits), each forward pass (i.e. each token generated) will require roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the generation speed you will be strictly capped at even if your processing power is measured in exaFLOPS. But let's say now that you have decided to instead quantize the model and then run the quantized version. Suppose you have made a Q4_K_M version (4 bits + some weights will take more). Now each of your forward passes will take roughly 2-3 GB (rough approximations, reality is different) of memory bandwith (actually, it will be around 2 GB), and even in the worst case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality of the model and lower it's performance, but in most modern quantization methods those losses are usually negligible (although, of course, they're still present). So, as you can see, it can be concluded that quantization "widens" (it's not removing it fully) memory bottleneck while still preserving (not always though) acceptable quality.

      (Sorry for my terrible English, it's not my native language)

  • mohsen1 30 minutes ago
    [dead]
  • hikaru_ai 1 hour ago
    [dead]