Rendered at 22:56:05 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
ademeure 12 hours ago [-]
This is very cool!
I've been working on something somewhat similar over the last few weeks, but trying to be much more general and arguably over-engineered! I like the scope of this project, keeping it limited to Triton and specific kinds of kernels makes it quite simple and efficient.
I'm confused by the progress graph though; it looks like it's benchmarking a 4096x4096x4096 fp16 matmul rather than a full repo, and it claims a 1.31x improvement vs cuBLAS... while running at 187 TFLOPS which is 18.9% of peak utilization? cuBLAS definitely gets much closer to peak than that - most likely it's limited by CPU overhead or something else? Benchmarking is hard!
Either way I'm excited to see other people working on this, I think it's an extremely promising area over the next 6 months.
veselin 13 hours ago [-]
I guess we will have a lot more benefits if we can get this to work on something like llama.cpp - since it really has a lot of kernels for different quantizations, a lot of home users, high hardware diversity - so it is a likely place with highest bang for the buck.
I guess they can be a contributor there.
LuxBennu 10 hours ago [-]
This is the right call. llama.cpp has dozens of hand-tuned CUDA kernels across Q4_K_M, Q5_K_S, Q8_0 and other quant formats, each targeting different hardware profiles. An autoresearch approach that could optimize these per-GPU would be huge — right now performance varies wildly between, say, an RTX 3090 and a 5070 Ti on the same quant format because the kernels are tuned for specific architectures. The hardware diversity in the llama.cpp user base is exactly where automated kernel search has the most to gain.
Jhsto 8 hours ago [-]
If I'd like to benchmark a new language / compile backend for LLM inference, what would be some good projects to try? If I'd start from tinygpt, what would make sense as the next step?
Something seems off.
For the 4kx4kx4k fp16 GEMM, cutlass is like 3x faster than this.
sspehr 14 hours ago [-]
Have you benchmarked this against autoscheduling like with TVMs Ansor?
NitpickLawyer 14 hours ago [-]
... and so it begins.
For a bit of context, goog already did something like this two generations of models ago, as announced in this blog post[1] from May '25:
> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
We are now seeing the same thing "at home", for any model. And with how RL heavy the new training runs have become, inference speedups will directly translate in faster training as well.
I've been working on something somewhat similar over the last few weeks, but trying to be much more general and arguably over-engineered! I like the scope of this project, keeping it limited to Triton and specific kinds of kernels makes it quite simple and efficient.
I'm confused by the progress graph though; it looks like it's benchmarking a 4096x4096x4096 fp16 matmul rather than a full repo, and it claims a 1.31x improvement vs cuBLAS... while running at 187 TFLOPS which is 18.9% of peak utilization? cuBLAS definitely gets much closer to peak than that - most likely it's limited by CPU overhead or something else? Benchmarking is hard!
Either way I'm excited to see other people working on this, I think it's an extremely promising area over the next 6 months.
I guess they can be a contributor there.
For a bit of context, goog already did something like this two generations of models ago, as announced in this blog post[1] from May '25:
> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
We are now seeing the same thing "at home", for any model. And with how RL heavy the new training runs have become, inference speedups will directly translate in faster training as well.
[1] - https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...