KernelBench-Verified Leaderboard (FP32)

Best-of-5 across all samples  ·  Baseline: PyTorch reference on H200  ·  All models evaluated in May 2026

Yunxiang Zhang1, Ping Yu2, Jianyu Wang1, Max (Xiangjun) Fan1, Julian Reed3, Azalia Mirhoseini3, Will Su1

1Meta  ·  2FAIR at Meta SuperIntelligence Lab  ·  3Stanford University

Correspondence: Yunxiang Zhang and Will Su at yunxiangzhang@meta.com, willsu@meta.com

Speedup and Memory Efficiency Metrics

Correct Speedup ↑ = geomean of baseline/kernel over correct problems only   |   Correctness % ↑ = fraction of problems solved correctly (best@k)   |   Fast@1 ↑ = fraction of problems where best correct speedup > 1×  (default sort — click any header to re-sort)
Correct Mem Eff ↑ = geomean of baseline/kernel memory over correct problems only (higher = uses less memory)   |   Mem Efficient % ↑ = fraction of problems where best correct kernel uses less memory than baseline

Level 1

ModelSpeedup (kernel/baseline)Memory Efficiency (baseline/kernel ↑)
Correct Speedup ↑ Correctness % ↑ Fast@1 ↑ Correct Mem Eff ↑ Mem Efficient % ↑
GPT-5.5 (medium)0.9099.0%52.0%1.2482.0%
Gemini 3.1 Pro (high)0.8483.0%47.0%1.1367.0%
Claude Sonnet 4.6 (high)0.5886.0%36.0%1.1670.0%
Claude Opus 4.7 (high)0.4695.0%35.0%1.2579.0%
Claude Opus 4.8 (high)0.5477.0%35.0%1.1763.0%
Gemini 3 Flash (high)0.2897.0%35.0%1.2983.0%
Kimi K2.60.3296.0%34.0%1.2881.0%

Level 2

ModelSpeedup (kernel/baseline)Memory Efficiency (baseline/kernel ↑)
Correct Speedup ↑ Correctness % ↑ Fast@1 ↑ Correct Mem Eff ↑ Mem Efficient % ↑
GPT-5.5 (medium)0.8599.0%64.9%1.1463.9%
Gemini 3.1 Pro (high)0.7396.9%62.9%0.9663.9%
Claude Opus 4.8 (high)0.6295.9%58.8%0.9960.8%
Gemini 3 Flash (high)0.6599.0%58.8%0.9659.8%
Kimi K2.60.6296.9%54.6%0.9461.9%
Claude Opus 4.7 (high)0.6094.8%52.6%1.0361.9%
Claude Sonnet 4.6 (high)0.5794.8%48.5%0.9155.7%

Level 3

ModelSpeedup (kernel/baseline)Memory Efficiency (baseline/kernel ↑)
Correct Speedup ↑ Correctness % ↑ Fast@1 ↑ Correct Mem Eff ↑ Mem Efficient % ↑
Gemini 3.1 Pro (high)0.8694.0%38.0%0.8026.0%
GPT-5.5 (medium)0.9296.0%32.0%0.8836.0%
Claude Sonnet 4.6 (high)0.7788.0%30.0%0.7420.0%
Claude Opus 4.7 (high)0.8092.0%28.0%0.8124.0%
Gemini 3 Flash (high)0.7690.0%26.0%0.7930.0%
Claude Opus 4.8 (high)0.6392.0%24.0%0.7828.0%
Kimi K2.60.5978.0%24.0%0.7924.0%

Memory–Speedup Tradeoff (per level)

Each dot = one model.   X = Correct Speedup — geomean of (baseline / kernel runtime) over correct problems only (higher = faster).   Y = Memory Efficiency — geomean of (baseline mem / kernel mem) over correct problems only (higher = model uses less GPU memory than baseline). Upper-right corner is best: fast and memory-efficient. Dashed lines mark the 1× reference (no change vs baseline).

 Claude Opus 4.7 (high) Claude Opus 4.8 (high) Claude Sonnet 4.6 (high) Gemini 3 Flash (high) Gemini 3.1 Pro (high) GPT-5.5 (medium) Kimi K2.6

Level 1

0.40 0.60 0.80 1.00 1.05 1.12 1.19 1.26 1.33 Correct Speedup (×, higher=better) Memory Efficiency (baseline/kernel mem, higher=better) Claude Opus 4.7 (high) Level 1 Correct Speedup: 0.46× Mem Eff: 1.25× Claude Opus 4.7 (high) Claude Opus 4.8 (high) Level 1 Correct Speedup: 0.54× Mem Eff: 1.17× Claude Opus 4.8 (high) Claude Sonnet 4.6 (high) Level 1 Correct Speedup: 0.58× Mem Eff: 1.16× Claude Sonnet 4.6 (high) Gemini 3 Flash (high) Level 1 Correct Speedup: 0.28× Mem Eff: 1.29× Gemini 3 Flash (high) Gemini 3.1 Pro (high) Level 1 Correct Speedup: 0.84× Mem Eff: 1.13× Gemini 3.1 Pro (high) GPT-5.5 (medium) Level 1 Correct Speedup: 0.90× Mem Eff: 1.24× GPT-5.5 (medium) Kimi K2.6 Level 1 Correct Speedup: 0.32× Mem Eff: 1.28× Kimi K2.6

Level 2

0.60 0.70 0.80 0.90 1.00 0.90 0.95 1.00 1.05 1.10 1.15 Correct Speedup (×, higher=better) Memory Efficiency (baseline/kernel mem, higher=better) Claude Opus 4.7 (high) Level 2 Correct Speedup: 0.60× Mem Eff: 1.03× Claude Opus 4.7 (high) Claude Opus 4.8 (high) Level 2 Correct Speedup: 0.62× Mem Eff: 0.99× Claude Opus 4.8 (high) Claude Sonnet 4.6 (high) Level 2 Correct Speedup: 0.57× Mem Eff: 0.91× Claude Sonnet 4.6 (high) Gemini 3 Flash (high) Level 2 Correct Speedup: 0.65× Mem Eff: 0.96× Gemini 3 Flash (high) Gemini 3.1 Pro (high) Level 2 Correct Speedup: 0.73× Mem Eff: 0.96× Gemini 3.1 Pro (high) GPT-5.5 (medium) Level 2 Correct Speedup: 0.85× Mem Eff: 1.14× GPT-5.5 (medium) Kimi K2.6 Level 2 Correct Speedup: 0.62× Mem Eff: 0.94× Kimi K2.6

Level 3

0.60 0.70 0.80 0.90 1.00 0.77 0.84 0.91 0.98 1.05 Correct Speedup (×, higher=better) Memory Efficiency (baseline/kernel mem, higher=better) Claude Opus 4.7 (high) Level 3 Correct Speedup: 0.80× Mem Eff: 0.81× Claude Opus 4.7 (high) Claude Opus 4.8 (high) Level 3 Correct Speedup: 0.63× Mem Eff: 0.78× Claude Opus 4.8 (high) Claude Sonnet 4.6 (high) Level 3 Correct Speedup: 0.77× Mem Eff: 0.74× Claude Sonnet 4.6 (high) Gemini 3 Flash (high) Level 3 Correct Speedup: 0.76× Mem Eff: 0.79× Gemini 3 Flash (high) Gemini 3.1 Pro (high) Level 3 Correct Speedup: 0.86× Mem Eff: 0.80× Gemini 3.1 Pro (high) GPT-5.5 (medium) Level 3 Correct Speedup: 0.92× Mem Eff: 0.88× GPT-5.5 (medium) Kimi K2.6 Level 3 Correct Speedup: 0.59× Mem Eff: 0.79× Kimi K2.6

Memory–Speedup Tradeoff (per model)

Each dot = one problem (correct-only).   X = best correct speedup (best@k).   Y = memory efficiency (baseline mem / kernel mem) of that same fastest-correct sample — higher is better on both axes. Dashed lines mark the 1× reference.

 Level 1 Level 2 Level 3

Claude Opus 4.7 (high)

-0.00 0.80 1.60 2.40 3.20 0.70 1.40 2.10 2.80 3.50 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.19× Mem eff: 0.78× L1 #2: Standard matrix multiplication Speedup: 0.22× Mem eff: 0.73× L1 #3: Batched matrix multiplication Speedup: 0.09× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.21× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.99× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.00× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.27× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.25× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.38× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.12× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.26× Mem eff: 0.78× L1 #14: Matmul for upper triangular matrices Speedup: 0.39× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.27× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.17× Mem eff: 0.73× L1 #17: Matmul with transposed B Speedup: 0.39× Mem eff: 0.73× L1 #18: Matmul with transposed both Speedup: 0.07× Mem eff: 1.00× L1 #19: ReLU Speedup: 0.95× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 1.00× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 1.00× Mem eff: 1.00× L1 #22: Tanh Speedup: 1.01× Mem eff: 1.00× L1 #23: Softmax Speedup: 0.90× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 0.96× Mem eff: 1.00× L1 #25: Swish Speedup: 2.47× Mem eff: 1.50× L1 #26: GELU Speedup: 0.99× Mem eff: 1.00× L1 #27: SELU Speedup: 1.00× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 1.00× Mem eff: 1.00× L1 #29: Softplus Speedup: 1.19× Mem eff: 1.00× L1 #30: Softsign Speedup: 3.47× Mem eff: 1.50× L1 #31: ELU Speedup: 1.00× Mem eff: 1.00× L1 #32: HardTanh Speedup: 1.00× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 1.62× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.50× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.30× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 2.11× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 1.73× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.42× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 1.71× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 2.07× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.46× Mem eff: 1.21× L1 #44: Average Pooling 1D Speedup: 3.63× Mem eff: 1.01× L1 #46: Average Pooling 3D Speedup: 1.63× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.91× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.92× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.20× Mem eff: 1.01× L1 #50: conv standard 2D square input square kernel Speedup: 0.31× Mem eff: 2.07× L1 #51: Argmax over a dimension Speedup: 1.11× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.20× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.17× Mem eff: 1.03× L1 #55: conv standard 2D asymmetric input square kernel Speedup: 0.10× Mem eff: 1.34× L1 #56: conv standard 2D asymmetric input asymmetric kernel Speedup: 0.05× Mem eff: 2.04× L1 #57: conv transposed 2D square input square kernel Speedup: 0.05× Mem eff: 2.01× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 0.17× Mem eff: 2.29× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.50× Mem eff: 0.51× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 0.24× Mem eff: 1.04× L1 #61: conv transposed 3D square input square kernel Speedup: 0.07× Mem eff: 2.04× L1 #62: conv standard 2D square input asymmetric kernel Speedup: 0.07× Mem eff: 2.04× L1 #63: conv standard 2D square input square kernel Speedup: 0.03× Mem eff: 1.11× L1 #64: conv transposed 1D Speedup: 0.07× Mem eff: 2.01× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 0.04× Mem eff: 2.03× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 0.26× Mem eff: 1.07× L1 #67: conv standard 1D Speedup: 0.10× Mem eff: 1.34× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 0.03× Mem eff: 2.02× L1 #69: conv transposed 2D asymmetric input asymmetric kernel Speedup: 0.02× Mem eff: 2.02× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 0.08× Mem eff: 2.02× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 0.05× Mem eff: 2.03× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 0.57× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 0.05× Mem eff: 2.03× L1 #74: conv transposed 1D dilated Speedup: 0.12× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.48× Mem eff: 1.04× L1 #76: conv standard 1D dilated strided Speedup: 0.05× Mem eff: 1.60× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 0.08× Mem eff: 2.05× L1 #78: conv transposed 2D asymmetric input asymmetric kernel padded Speedup: 0.05× Mem eff: 2.03× L1 #79: conv transposed 1D asymmetric input square kernel padded strided dilated Speedup: 0.14× Mem eff: 2.02× L1 #80: conv standard 2D square input asymmetric kernel dilated padded Speedup: 0.05× Mem eff: 2.04× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.19× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.78× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 2.64× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 1.27× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 1.36× Mem eff: 1.03× L1 #86: conv depthwise separable 2D Speedup: 1.12× Mem eff: 1.01× L1 #87: conv pointwise 2D Speedup: 0.18× Mem eff: 0.38× L1 #89: cumsum Speedup: 0.55× Mem eff: 1.00× L1 #90: cumprod Speedup: 0.55× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 1.23× Mem eff: 1.50× L1 #92: cumsum exclusive Speedup: 0.83× Mem eff: 1.50× L1 #94: MSELoss Speedup: 2.99× Mem eff: 2.00× L1 #96: HuberLoss Speedup: 1.90× Mem eff: 2.00× L1 #97: ScaledDotProductAttention Speedup: 3.15× Mem eff: 1.34× L1 #98: KLDivLoss Speedup: 3.48× Mem eff: 3.02× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.10× Mem eff: 1.25× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 1.51× Mem eff: 1.01× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 2.19× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 1.04× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.08× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 1.00× Mem eff: 1.14× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.18× Mem eff: 1.02× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.06× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.07× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.21× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.10× Mem eff: 1.04× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.07× Mem eff: 0.63× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 0.74× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 1.33× Mem eff: 1.06× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.34× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.24× Mem eff: 1.01× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 2.16× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.34× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 0.83× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.43× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.45× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.09× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.60× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.23× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.20× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.60× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.33× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.64× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.51× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.44× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.19× Mem eff: 0.93× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.21× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.60× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.13× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.01× Mem eff: 1.41× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.57× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.05× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.18× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.50× Mem eff: 1.03× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.29× Mem eff: 1.02× L2 #52: Conv2d Activation BatchNorm Speedup: 1.21× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.65× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.16× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #57: Conv2d ReLU HardSwish Speedup: 1.73× Mem eff: 1.50× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.46× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.06× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 1.08× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.25× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.07× Mem eff: 0.63× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #66: Matmul Dropout Softmax Speedup: 0.22× Mem eff: 0.51× L2 #68: Matmul Min Subtract Speedup: 0.03× Mem eff: 0.52× L2 #69: Conv2d HardSwish ReLU Speedup: 1.17× Mem eff: 1.03× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.65× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.17× Mem eff: 1.03× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.00× Mem eff: 0.97× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.65× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.27× Mem eff: 0.64× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.09× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 0.76× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.40× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.65× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.76× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.55× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.71× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.60× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.28× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.23× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.12× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.23× Mem eff: 1.01× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.20× Mem eff: 1.01× L2 #92: Conv2d GroupNorm Tanh HardSwish ResidualAdd LogSumExp Speedup: 1.83× Mem eff: 2.90× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.49× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.21× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.65× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.17× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.17× Mem eff: 0.65× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 0.01× Mem eff: 0.61× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.17× Mem eff: 1.00× L3 #1: MLP Speedup: 0.21× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.26× Mem eff: 0.65× L3 #5: AlexNet Speedup: 0.94× Mem eff: 0.69× L3 #6: GoogleNetInceptionModule Speedup: 1.20× Mem eff: 1.05× L3 #7: GoogleNetInceptionV1 Speedup: 0.86× Mem eff: 0.57× L3 #8: ResNetBasicBlock Speedup: 1.00× Mem eff: 0.61× L3 #9: ResNet18 Speedup: 0.89× Mem eff: 0.52× L3 #10: ResNet101 Speedup: 0.88× Mem eff: 0.19× L3 #11: VGG16 Speedup: 1.02× Mem eff: 0.67× L3 #12: VGG19 Speedup: 1.03× Mem eff: 0.57× L3 #13: DenseNet121TransitionLayer Speedup: 1.27× Mem eff: 0.90× L3 #14: DenseNet121DenseBlock Speedup: 1.00× Mem eff: 0.48× L3 #15: DenseNet121 Speedup: 0.94× Mem eff: 0.18× L3 #16: DenseNet201 Speedup: 0.89× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 1.37× Mem eff: 1.00× L3 #18: SqueezeNet Speedup: 1.18× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.85× Mem eff: 0.28× L3 #20: MobileNetV2 Speedup: 0.69× Mem eff: 0.18× L3 #21: EfficientNetMBConv Speedup: 1.00× Mem eff: 0.54× L3 #22: EfficientNetB0 Speedup: 0.90× Mem eff: 0.22× L3 #23: EfficientNetB1 Speedup: 0.85× Mem eff: 0.27× L3 #25: ShuffleNetUnit Speedup: 0.95× Mem eff: 1.01× L3 #26: ShuffleNet Speedup: 0.99× Mem eff: 0.69× L3 #27: RegNet Speedup: 1.04× Mem eff: 0.81× L3 #28: VisionTransformer Speedup: 0.82× Mem eff: 0.46× L3 #29: SwinMLP Speedup: 0.88× Mem eff: 0.78× L3 #30: SwinTransformerV2 Speedup: 0.84× Mem eff: 0.37× L3 #32: ConvolutionalVisionTransformer Speedup: 1.21× Mem eff: 0.82× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.51× L3 #35: LSTM Speedup: 1.00× Mem eff: 0.96× L3 #37: LSTMCn Speedup: 0.24× Mem eff: 2.71× L3 #38: LSTMBidirectional Speedup: 1.01× Mem eff: 0.95× L3 #39: GRU Speedup: 0.56× Mem eff: 3.37× L3 #40: GRUHidden Speedup: 0.56× Mem eff: 3.37× L3 #44: MiniGPTBlock Speedup: 0.50× Mem eff: 0.99× L3 #45: UNetSoftmax Speedup: 1.01× Mem eff: 0.45× L3 #46: NetVladWithGhostClusters Speedup: 0.84× Mem eff: 1.01× L3 #47: NetVladNoGhostClusters Speedup: 0.39× Mem eff: 1.00× L3 #48: Mamba2ReturnY Speedup: 1.02× Mem eff: 0.67× L3 #50: ReLUSelfAttention Speedup: 0.80× Mem eff: 1.41×

Claude Opus 4.8 (high)

-0.00 0.80 1.60 2.40 3.20 0.70 1.40 2.10 2.80 3.50 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.08× Mem eff: 1.00× L1 #2: Standard matrix multiplication Speedup: 0.06× Mem eff: 1.00× L1 #3: Batched matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.22× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.99× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.05× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.21× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.07× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.13× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.04× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.05× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.15× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.15× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.03× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.01× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.01× Mem eff: 1.00× L1 #19: ReLU Speedup: 1.01× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 0.95× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 0.95× Mem eff: 1.00× L1 #22: Tanh Speedup: 0.96× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.35× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 0.96× Mem eff: 1.00× L1 #25: Swish Speedup: 2.35× Mem eff: 1.50× L1 #26: GELU Speedup: 1.00× Mem eff: 1.00× L1 #27: SELU Speedup: 0.95× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 0.95× Mem eff: 1.00× L1 #29: Softplus Speedup: 1.15× Mem eff: 1.00× L1 #30: Softsign Speedup: 3.29× Mem eff: 1.50× L1 #31: ELU Speedup: 0.95× Mem eff: 1.00× L1 #32: HardTanh Speedup: 0.95× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 0.24× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.42× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.38× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 2.11× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 1.26× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.19× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.74× Mem eff: 1.00× L1 #40: LayerNorm Speedup: 3.14× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 2.10× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 1.49× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.13× Mem eff: 1.21× L1 #44: Average Pooling 1D Speedup: 3.10× Mem eff: 1.01× L1 #45: Average Pooling 2D Speedup: 1.14× Mem eff: 1.00× L1 #46: Average Pooling 3D Speedup: 1.69× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.89× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.83× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 0.22× Mem eff: 1.01× L1 #51: Argmax over a dimension Speedup: 1.12× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.12× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.09× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.07× Mem eff: 1.03× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.07× Mem eff: 1.01× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 0.09× Mem eff: 1.04× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 0.10× Mem eff: 1.07× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 0.50× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 0.05× Mem eff: 2.03× L1 #74: conv transposed 1D dilated Speedup: 0.09× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.51× Mem eff: 1.04× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 0.09× Mem eff: 2.05× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.22× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.70× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 1.51× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 1.61× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 1.30× Mem eff: 1.03× L1 #89: cumsum Speedup: 0.64× Mem eff: 1.00× L1 #90: cumprod Speedup: 0.81× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 1.74× Mem eff: 1.50× L1 #92: cumsum exclusive Speedup: 1.31× Mem eff: 1.50× L1 #94: MSELoss Speedup: 2.91× Mem eff: 2.00× L1 #96: HuberLoss Speedup: 1.92× Mem eff: 2.00× L1 #97: ScaledDotProductAttention Speedup: 1.01× Mem eff: 1.00× L1 #100: HingeLoss Speedup: 3.69× Mem eff: 3.01× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.20× Mem eff: 1.01× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 1.83× Mem eff: 1.01× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 1.02× Mem eff: 0.97× L2 #4: Conv2d Mish Mish Speedup: 1.03× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.22× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 1.34× Mem eff: 1.14× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.17× Mem eff: 1.02× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.05× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.14× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.16× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.07× Mem eff: 0.87× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.60× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 1.03× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 1.38× Mem eff: 1.06× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.44× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.24× Mem eff: 1.01× L2 #18: Matmul Sum Max AvgPool LogSumExp LogSumExp Speedup: 3.66× Mem eff: 0.50× L2 #19: ConvTranspose2d GELU GroupNorm Speedup: 1.05× Mem eff: 1.00× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.17× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 0.82× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.08× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.44× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.17× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.60× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.22× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.20× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.60× L2 #34: ConvTranspose3d LayerNorm GELU Scaling Speedup: 2.09× Mem eff: 1.01× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.33× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.64× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.47× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.50× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.18× Mem eff: 0.94× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.21× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.72× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.12× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 0.87× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.62× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.05× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.18× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.52× Mem eff: 1.03× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.30× Mem eff: 1.02× L2 #52: Conv2d Activation BatchNorm Speedup: 1.21× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.65× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.22× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.00× Mem eff: 0.51× L2 #57: Conv2d ReLU HardSwish Speedup: 1.83× Mem eff: 1.50× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.46× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.10× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 1.04× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.25× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.02× Mem eff: 0.63× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #67: Conv2d GELU GlobalAvgPool Speedup: 1.19× Mem eff: 1.90× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.51× L2 #69: Conv2d HardSwish ReLU Speedup: 1.17× Mem eff: 1.03× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.65× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.17× Mem eff: 1.03× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.05× Mem eff: 1.02× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.66× Mem eff: 1.06× L2 #76: Gemm Add ReLU Speedup: 0.02× Mem eff: 0.63× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.05× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 0.99× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.42× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.65× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.72× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.44× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.59× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.63× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.28× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.23× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.12× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.31× Mem eff: 1.01× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.20× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.49× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.21× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.65× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.18× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.17× Mem eff: 0.65× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 0.15× Mem eff: 0.58× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.17× Mem eff: 1.00× L3 #1: MLP Speedup: 0.02× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.04× Mem eff: 0.65× L3 #4: LeNet5 Speedup: 2.93× Mem eff: 2.02× L3 #5: AlexNet Speedup: 0.99× Mem eff: 0.94× L3 #6: GoogleNetInceptionModule Speedup: 1.24× Mem eff: 1.45× L3 #7: GoogleNetInceptionV1 Speedup: 0.83× Mem eff: 0.57× L3 #8: ResNetBasicBlock Speedup: 1.00× Mem eff: 0.61× L3 #9: ResNet18 Speedup: 0.88× Mem eff: 0.52× L3 #10: ResNet101 Speedup: 0.88× Mem eff: 0.19× L3 #11: VGG16 Speedup: 0.91× Mem eff: 0.67× L3 #12: VGG19 Speedup: 1.03× Mem eff: 0.66× L3 #13: DenseNet121TransitionLayer Speedup: 1.00× Mem eff: 0.90× L3 #14: DenseNet121DenseBlock Speedup: 1.00× Mem eff: 0.48× L3 #15: DenseNet121 Speedup: 1.05× Mem eff: 0.18× L3 #16: DenseNet201 Speedup: 0.90× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 0.76× Mem eff: 1.94× L3 #18: SqueezeNet Speedup: 1.03× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.86× Mem eff: 0.28× L3 #20: MobileNetV2 Speedup: 0.83× Mem eff: 0.18× L3 #21: EfficientNetMBConv Speedup: 1.00× Mem eff: 0.54× L3 #22: EfficientNetB0 Speedup: 0.89× Mem eff: 0.22× L3 #23: EfficientNetB1 Speedup: 0.87× Mem eff: 0.27× L3 #25: ShuffleNetUnit Speedup: 0.96× Mem eff: 0.87× L3 #26: ShuffleNet Speedup: 0.99× Mem eff: 0.19× L3 #27: RegNet Speedup: 1.01× Mem eff: 0.50× L3 #28: VisionTransformer Speedup: 0.86× Mem eff: 0.46× L3 #29: SwinMLP Speedup: 0.87× Mem eff: 0.69× L3 #30: SwinTransformerV2 Speedup: 0.85× Mem eff: 0.48× L3 #31: VisionAttention Speedup: 1.02× Mem eff: 1.00× L3 #32: ConvolutionalVisionTransformer Speedup: 1.14× Mem eff: 0.82× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.51× L3 #34: VanillaRNNHidden Speedup: 2.65× Mem eff: 0.94× L3 #35: LSTM Speedup: 0.23× Mem eff: 2.31× L3 #36: LSTMHn Speedup: 0.28× Mem eff: 3.39× L3 #37: LSTMCn Speedup: 0.25× Mem eff: 2.70× L3 #38: LSTMBidirectional Speedup: 1.00× Mem eff: 0.95× L3 #39: GRU Speedup: 0.56× Mem eff: 3.34× L3 #40: GRUHidden Speedup: 0.56× Mem eff: 3.37× L3 #44: MiniGPTBlock Speedup: 0.50× Mem eff: 1.11× L3 #45: UNetSoftmax Speedup: 1.02× Mem eff: 0.45× L3 #46: NetVladWithGhostClusters Speedup: 0.68× Mem eff: 1.08× L3 #48: Mamba2ReturnY Speedup: 0.20× Mem eff: 1.08× L3 #50: ReLUSelfAttention Speedup: 0.09× Mem eff: 3.62×

Claude Sonnet 4.6 (high)

-0.00 0.50 1.00 1.50 2.00 0.50 1.00 1.50 2.00 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.12× Mem eff: 1.00× L1 #2: Standard matrix multiplication Speedup: 0.13× Mem eff: 1.00× L1 #3: Batched matrix multiplication Speedup: 0.16× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.07× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.63× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.29× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.08× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.38× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.02× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.13× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.10× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.11× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.21× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.12× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.01× Mem eff: 1.00× L1 #19: ReLU Speedup: 0.63× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 0.63× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 0.58× Mem eff: 1.00× L1 #22: Tanh Speedup: 0.61× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.12× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 0.99× Mem eff: 1.00× L1 #25: Swish Speedup: 1.44× Mem eff: 1.50× L1 #27: SELU Speedup: 0.62× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 0.60× Mem eff: 1.00× L1 #29: Softplus Speedup: 0.69× Mem eff: 1.00× L1 #30: Softsign Speedup: 2.07× Mem eff: 1.50× L1 #31: ELU Speedup: 0.62× Mem eff: 1.00× L1 #32: HardTanh Speedup: 0.63× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 0.87× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.43× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.29× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 0.24× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 0.86× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.77× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.89× Mem eff: 1.00× L1 #42: Max Pooling 2D Speedup: 1.58× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.21× Mem eff: 1.21× L1 #46: Average Pooling 3D Speedup: 1.82× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.83× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.09× Mem eff: 1.01× L1 #50: conv standard 2D square input square kernel Speedup: 1.00× Mem eff: 1.04× L1 #51: Argmax over a dimension Speedup: 1.11× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.19× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.09× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.99× Mem eff: 1.03× L1 #56: conv standard 2D asymmetric input asymmetric kernel Speedup: 1.00× Mem eff: 1.21× L1 #57: conv transposed 2D square input square kernel Speedup: 1.01× Mem eff: 1.00× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 1.13× Mem eff: 1.15× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.50× Mem eff: 0.51× L1 #61: conv transposed 3D square input square kernel Speedup: 1.66× Mem eff: 1.02× L1 #62: conv standard 2D square input asymmetric kernel Speedup: 1.53× Mem eff: 1.22× L1 #63: conv standard 2D square input square kernel Speedup: 1.01× Mem eff: 1.00× L1 #64: conv transposed 1D Speedup: 0.04× Mem eff: 2.01× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 1.03× Mem eff: 1.36× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 1.76× Mem eff: 0.97× L1 #67: conv standard 1D Speedup: 0.08× Mem eff: 1.34× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 1.00× Mem eff: 1.01× L1 #69: conv transposed 2D asymmetric input asymmetric kernel Speedup: 1.12× Mem eff: 1.52× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 1.02× Mem eff: 1.01× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 1.10× Mem eff: 1.35× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 1.02× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 1.00× Mem eff: 1.01× L1 #74: conv transposed 1D dilated Speedup: 0.08× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.08× Mem eff: 1.04× L1 #76: conv standard 1D dilated strided Speedup: 0.04× Mem eff: 1.60× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 1.00× Mem eff: 1.03× L1 #78: conv transposed 2D asymmetric input asymmetric kernel padded Speedup: 0.47× Mem eff: 1.02× L1 #80: conv standard 2D square input asymmetric kernel dilated padded Speedup: 0.03× Mem eff: 2.04× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.03× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.33× Mem eff: 1.02× L1 #89: cumsum Speedup: 0.22× Mem eff: 1.00× L1 #90: cumprod Speedup: 0.62× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 0.78× Mem eff: 1.50× L1 #96: HuberLoss Speedup: 0.77× Mem eff: 2.00× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.10× Mem eff: 1.01× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 1.59× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 1.00× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.08× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 0.74× Mem eff: 2.04× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.20× Mem eff: 1.81× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 0.98× Mem eff: 1.03× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.15× Mem eff: 0.60× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.20× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.00× Mem eff: 0.87× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.60× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 1.02× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 1.01× Mem eff: 0.74× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.36× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.25× Mem eff: 1.01× L2 #18: Matmul Sum Max AvgPool LogSumExp LogSumExp Speedup: 0.32× Mem eff: 0.50× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 1.88× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.21× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 1.02× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.08× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.35× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 0.46× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.60× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.23× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.20× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.60× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.33× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.33× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.48× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.38× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.18× Mem eff: 0.94× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.21× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.73× L2 #42: ConvTranspose2d GlobalAvgPool BiasAdd LogSumExp Sum Multiply Speedup: 0.25× Mem eff: 1.01× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.05× Mem eff: 1.03× L2 #44: ConvTranspose2d Multiply GlobalAvgPool GlobalAvgPool Mean Speedup: 1.14× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 0.94× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.58× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.02× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.18× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 0.55× Mem eff: 0.54× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.30× Mem eff: 1.02× L2 #51: Gemm Subtract GlobalAvgPool LogSumExp GELU ResidualAdd Speedup: 0.16× Mem eff: 0.74× L2 #52: Conv2d Activation BatchNorm Speedup: 1.26× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.65× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.16× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #57: Conv2d ReLU HardSwish Speedup: 1.65× Mem eff: 1.50× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.46× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.01× Mem eff: 1.01× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 0.65× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.25× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.14× Mem eff: 0.60× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #65: Conv2d AvgPool Sigmoid Sum Speedup: 1.14× Mem eff: 1.06× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.51× L2 #69: Conv2d HardSwish ReLU Speedup: 1.05× Mem eff: 1.03× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.65× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.06× Mem eff: 1.03× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.05× Mem eff: 1.02× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.70× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.27× Mem eff: 0.64× L2 #76: Gemm Add ReLU Speedup: 0.15× Mem eff: 0.60× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.01× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 0.90× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.38× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.65× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.77× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.55× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.48× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.60× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.21× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.23× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.12× Mem eff: 1.02× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.20× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.49× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.21× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.65× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.18× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.18× Mem eff: 0.62× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.18× Mem eff: 1.00× L3 #1: MLP Speedup: 0.21× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.26× Mem eff: 0.65× L3 #4: LeNet5 Speedup: 0.92× Mem eff: 1.00× L3 #5: AlexNet Speedup: 0.90× Mem eff: 0.69× L3 #6: GoogleNetInceptionModule Speedup: 0.95× Mem eff: 0.96× L3 #8: ResNetBasicBlock Speedup: 1.03× Mem eff: 0.71× L3 #9: ResNet18 Speedup: 0.92× Mem eff: 0.60× L3 #10: ResNet101 Speedup: 0.88× Mem eff: 0.19× L3 #11: VGG16 Speedup: 1.00× Mem eff: 0.59× L3 #12: VGG19 Speedup: 1.03× Mem eff: 0.57× L3 #13: DenseNet121TransitionLayer Speedup: 1.00× Mem eff: 0.90× L3 #14: DenseNet121DenseBlock Speedup: 1.00× Mem eff: 0.48× L3 #15: DenseNet121 Speedup: 0.90× Mem eff: 0.18× L3 #16: DenseNet201 Speedup: 0.92× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 1.03× Mem eff: 1.00× L3 #18: SqueezeNet Speedup: 1.01× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.85× Mem eff: 0.28× L3 #21: EfficientNetMBConv Speedup: 1.00× Mem eff: 0.54× L3 #22: EfficientNetB0 Speedup: 0.94× Mem eff: 0.66× L3 #23: EfficientNetB1 Speedup: 0.81× Mem eff: 0.27× L3 #25: ShuffleNetUnit Speedup: 0.98× Mem eff: 1.01× L3 #26: ShuffleNet Speedup: 0.99× Mem eff: 0.21× L3 #29: SwinMLP Speedup: 0.94× Mem eff: 0.72× L3 #30: SwinTransformerV2 Speedup: 0.85× Mem eff: 0.57× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.52× L3 #34: VanillaRNNHidden Speedup: 1.03× Mem eff: 0.93× L3 #35: LSTM Speedup: 1.01× Mem eff: 0.96× L3 #36: LSTMHn Speedup: 1.01× Mem eff: 0.96× L3 #37: LSTMCn Speedup: 0.98× Mem eff: 0.96× L3 #38: LSTMBidirectional Speedup: 1.00× Mem eff: 0.95× L3 #39: GRU Speedup: 1.00× Mem eff: 1.07× L3 #40: GRUHidden Speedup: 1.01× Mem eff: 1.07× L3 #41: GRUBidirectional Speedup: 1.01× Mem eff: 1.01× L3 #42: GRUBidirectionalHidden Speedup: 1.00× Mem eff: 1.01× L3 #43: MinGPTCausalAttention Speedup: 0.62× Mem eff: 2.36× L3 #44: MiniGPTBlock Speedup: 0.50× Mem eff: 1.17× L3 #45: UNetSoftmax Speedup: 1.01× Mem eff: 0.45× L3 #46: NetVladWithGhostClusters Speedup: 0.71× Mem eff: 0.88× L3 #47: NetVladNoGhostClusters Speedup: 0.12× Mem eff: 1.18× L3 #48: Mamba2ReturnY Speedup: 0.77× Mem eff: 0.61× L3 #49: Mamba2ReturnFinalState Speedup: 1.08× Mem eff: 0.71× L3 #50: ReLUSelfAttention Speedup: 0.71× Mem eff: 0.94×

Gemini 3 Flash (high)

-0.00 0.60 1.20 1.80 2.40 0.50 1.00 1.50 2.00 Speedup (×) Mem Efficiency (×) L1 #3: Batched matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.24× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.62× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.05× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.12× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.07× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.04× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.02× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.15× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.15× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.03× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.02× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.02× Mem eff: 1.00× L1 #19: ReLU Speedup: 1.00× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 1.00× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 0.58× Mem eff: 1.00× L1 #22: Tanh Speedup: 0.60× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.49× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 1.00× Mem eff: 1.00× L1 #25: Swish Speedup: 1.45× Mem eff: 1.50× L1 #26: GELU Speedup: 0.98× Mem eff: 1.00× L1 #27: SELU Speedup: 0.61× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 1.00× Mem eff: 1.00× L1 #29: Softplus Speedup: 0.67× Mem eff: 1.00× L1 #30: Softsign Speedup: 2.07× Mem eff: 1.50× L1 #31: ELU Speedup: 0.99× Mem eff: 1.00× L1 #32: HardTanh Speedup: 1.00× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 0.99× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.00× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.05× Mem eff: 1.00× L1 #37: FrobeniusNorm Speedup: 1.25× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.33× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.90× Mem eff: 1.00× L1 #40: LayerNorm Speedup: 2.52× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 2.28× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 1.35× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.17× Mem eff: 1.21× L1 #45: Average Pooling 2D Speedup: 1.12× Mem eff: 1.00× L1 #46: Average Pooling 3D Speedup: 1.55× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.19× Mem eff: 1.01× L1 #50: conv standard 2D square input square kernel Speedup: 0.12× Mem eff: 2.07× L1 #51: Argmax over a dimension Speedup: 1.20× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.19× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.18× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.06× Mem eff: 1.03× L1 #55: conv standard 2D asymmetric input square kernel Speedup: 0.01× Mem eff: 1.34× L1 #56: conv standard 2D asymmetric input asymmetric kernel Speedup: 0.02× Mem eff: 2.04× L1 #57: conv transposed 2D square input square kernel Speedup: 0.03× Mem eff: 2.01× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 0.05× Mem eff: 2.29× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.08× Mem eff: 1.01× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 0.09× Mem eff: 1.04× L1 #61: conv transposed 3D square input square kernel Speedup: 0.03× Mem eff: 2.04× L1 #62: conv standard 2D square input asymmetric kernel Speedup: 0.03× Mem eff: 2.04× L1 #63: conv standard 2D square input square kernel Speedup: 0.02× Mem eff: 1.11× L1 #64: conv transposed 1D Speedup: 0.04× Mem eff: 2.01× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 0.03× Mem eff: 2.03× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 0.11× Mem eff: 1.07× L1 #67: conv standard 1D Speedup: 0.05× Mem eff: 1.34× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 0.03× Mem eff: 2.02× L1 #69: conv transposed 2D asymmetric input asymmetric kernel Speedup: 0.02× Mem eff: 2.02× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 0.07× Mem eff: 2.02× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 0.07× Mem eff: 2.03× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 0.42× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 0.02× Mem eff: 2.03× L1 #74: conv transposed 1D dilated Speedup: 0.08× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.47× Mem eff: 1.04× L1 #76: conv standard 1D dilated strided Speedup: 0.04× Mem eff: 1.60× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 0.09× Mem eff: 2.05× L1 #78: conv transposed 2D asymmetric input asymmetric kernel padded Speedup: 0.05× Mem eff: 2.03× L1 #79: conv transposed 1D asymmetric input square kernel padded strided dilated Speedup: 0.14× Mem eff: 2.02× L1 #80: conv standard 2D square input asymmetric kernel dilated padded Speedup: 0.03× Mem eff: 2.04× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.19× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.15× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 1.97× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 1.09× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 1.12× Mem eff: 1.03× L1 #86: conv depthwise separable 2D Speedup: 0.02× Mem eff: 1.34× L1 #87: conv pointwise 2D Speedup: 0.03× Mem eff: 1.00× L1 #89: cumsum Speedup: 0.87× Mem eff: 1.00× L1 #90: cumprod Speedup: 1.17× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 2.87× Mem eff: 1.50× L1 #92: cumsum exclusive Speedup: 1.58× Mem eff: 1.50× L1 #93: masked cumsum Speedup: 1.18× Mem eff: 1.78× L1 #96: HuberLoss Speedup: 1.70× Mem eff: 2.00× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 2.26× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 0.99× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.08× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 1.17× Mem eff: 2.04× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.18× Mem eff: 1.81× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.06× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.15× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.18× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.01× Mem eff: 0.87× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.63× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 1.02× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 1.01× Mem eff: 0.74× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.36× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.14× Mem eff: 1.01× L2 #18: Matmul Sum Max AvgPool LogSumExp LogSumExp Speedup: 0.84× Mem eff: 0.61× L2 #19: ConvTranspose2d GELU GroupNorm Speedup: 1.07× Mem eff: 1.00× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 1.86× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.20× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 1.02× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.08× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.37× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.22× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.65× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.63× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.23× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.45× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.57× L2 #34: ConvTranspose3d LayerNorm GELU Scaling Speedup: 2.29× Mem eff: 1.01× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.31× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.39× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.51× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.25× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.18× Mem eff: 0.73× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.67× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.72× L2 #42: ConvTranspose2d GlobalAvgPool BiasAdd LogSumExp Sum Multiply Speedup: 0.95× Mem eff: 1.01× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.13× Mem eff: 1.03× L2 #44: ConvTranspose2d Multiply GlobalAvgPool GlobalAvgPool Mean Speedup: 1.14× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 0.94× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.66× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.01× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.18× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.55× Mem eff: 1.03× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.30× Mem eff: 1.02× L2 #52: Conv2d Activation BatchNorm Speedup: 1.19× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.71× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.16× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.46× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.06× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 0.93× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.26× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.14× Mem eff: 0.63× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #65: Conv2d AvgPool Sigmoid Sum Speedup: 1.14× Mem eff: 1.06× L2 #66: Matmul Dropout Softmax Speedup: 0.22× Mem eff: 0.51× L2 #67: Conv2d GELU GlobalAvgPool Speedup: 1.10× Mem eff: 1.90× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.51× L2 #69: Conv2d HardSwish ReLU Speedup: 1.07× Mem eff: 1.94× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.68× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.08× Mem eff: 1.94× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.05× Mem eff: 1.02× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.66× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.27× Mem eff: 0.64× L2 #76: Gemm Add ReLU Speedup: 0.15× Mem eff: 0.63× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.03× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 1.14× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.39× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.68× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.79× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.55× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.57× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.63× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.19× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.24× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.12× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.25× Mem eff: 1.01× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.20× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.52× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.22× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.17× Mem eff: 0.68× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.17× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.17× Mem eff: 0.59× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 0.15× Mem eff: 0.58× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.07× Mem eff: 1.00× L3 #1: MLP Speedup: 0.21× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.27× Mem eff: 0.55× L3 #5: AlexNet Speedup: 1.06× Mem eff: 0.94× L3 #6: GoogleNetInceptionModule Speedup: 1.20× Mem eff: 1.14× L3 #7: GoogleNetInceptionV1 Speedup: 0.87× Mem eff: 0.83× L3 #8: ResNetBasicBlock Speedup: 1.00× Mem eff: 0.61× L3 #9: ResNet18 Speedup: 0.91× Mem eff: 0.60× L3 #10: ResNet101 Speedup: 0.90× Mem eff: 0.20× L3 #11: VGG16 Speedup: 1.00× Mem eff: 0.67× L3 #12: VGG19 Speedup: 1.03× Mem eff: 0.66× L3 #13: DenseNet121TransitionLayer Speedup: 1.00× Mem eff: 0.90× L3 #14: DenseNet121DenseBlock Speedup: 1.00× Mem eff: 0.48× L3 #15: DenseNet121 Speedup: 0.82× Mem eff: 0.18× L3 #16: DenseNet201 Speedup: 0.98× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 1.07× Mem eff: 1.00× L3 #18: SqueezeNet Speedup: 1.04× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.94× Mem eff: 0.28× L3 #20: MobileNetV2 Speedup: 0.90× Mem eff: 0.26× L3 #21: EfficientNetMBConv Speedup: 1.00× Mem eff: 0.54× L3 #22: EfficientNetB0 Speedup: 0.94× Mem eff: 0.31× L3 #23: EfficientNetB1 Speedup: 0.90× Mem eff: 0.36× L3 #24: EfficientNetB2 Speedup: 0.84× Mem eff: 0.64× L3 #25: ShuffleNetUnit Speedup: 1.02× Mem eff: 1.01× L3 #26: ShuffleNet Speedup: 1.00× Mem eff: 0.70× L3 #27: RegNet Speedup: 1.00× Mem eff: 0.50× L3 #29: SwinMLP Speedup: 0.79× Mem eff: 0.70× L3 #30: SwinTransformerV2 Speedup: 0.77× Mem eff: 0.42× L3 #31: VisionAttention Speedup: 0.80× Mem eff: 1.00× L3 #32: ConvolutionalVisionTransformer Speedup: 0.89× Mem eff: 0.82× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.35× L3 #37: LSTMCn Speedup: 0.37× Mem eff: 2.21× L3 #38: LSTMBidirectional Speedup: 1.01× Mem eff: 0.95× L3 #43: MinGPTCausalAttention Speedup: 0.47× Mem eff: 1.02× L3 #44: MiniGPTBlock Speedup: 0.50× Mem eff: 1.17× L3 #46: NetVladWithGhostClusters Speedup: 0.55× Mem eff: 1.08× L3 #47: NetVladNoGhostClusters Speedup: 1.06× Mem eff: 1.14× L3 #50: ReLUSelfAttention Speedup: 0.80× Mem eff: 1.41×

Gemini 3.1 Pro (high)

-0.00 0.90 1.80 2.70 3.60 0.50 1.00 1.50 2.00 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.08× Mem eff: 1.00× L1 #2: Standard matrix multiplication Speedup: 0.02× Mem eff: 1.00× L1 #3: Batched matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.23× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.99× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.16× Mem eff: 0.97× L1 #7: Matmul with small K dimension Speedup: 0.32× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.07× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.42× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.11× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.09× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.43× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.42× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.03× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.09× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.02× Mem eff: 1.00× L1 #19: ReLU Speedup: 1.00× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 1.00× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 1.00× Mem eff: 1.00× L1 #22: Tanh Speedup: 1.01× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.51× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 1.31× Mem eff: 1.00× L1 #25: Swish Speedup: 2.48× Mem eff: 1.50× L1 #26: GELU Speedup: 0.99× Mem eff: 1.00× L1 #27: SELU Speedup: 1.00× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 1.00× Mem eff: 1.00× L1 #29: Softplus Speedup: 1.19× Mem eff: 1.00× L1 #30: Softsign Speedup: 3.47× Mem eff: 1.50× L1 #31: ELU Speedup: 1.00× Mem eff: 1.00× L1 #32: HardTanh Speedup: 1.00× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 1.24× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.50× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.57× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 2.99× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 1.27× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.53× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.90× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 2.88× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 1.36× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.10× Mem eff: 1.21× L1 #44: Average Pooling 1D Speedup: 3.32× Mem eff: 1.01× L1 #45: Average Pooling 2D Speedup: 1.14× Mem eff: 1.00× L1 #46: Average Pooling 3D Speedup: 1.79× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.92× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.21× Mem eff: 1.01× L1 #51: Argmax over a dimension Speedup: 1.19× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.20× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.18× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.06× Mem eff: 1.03× L1 #57: conv transposed 2D square input square kernel Speedup: 1.02× Mem eff: 1.00× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 1.02× Mem eff: 1.15× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 1.00× Mem eff: 1.01× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 1.00× Mem eff: 1.04× L1 #61: conv transposed 3D square input square kernel Speedup: 1.65× Mem eff: 1.02× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 1.00× Mem eff: 1.02× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 1.00× Mem eff: 1.07× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 1.00× Mem eff: 1.01× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 1.03× Mem eff: 1.01× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 1.00× Mem eff: 1.01× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 1.00× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 1.00× Mem eff: 1.01× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.55× Mem eff: 1.04× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 0.98× Mem eff: 1.00× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.22× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.00× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 1.58× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 1.09× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 1.07× Mem eff: 1.03× L1 #86: conv depthwise separable 2D Speedup: 0.98× Mem eff: 1.01× L1 #89: cumsum Speedup: 1.16× Mem eff: 1.00× L1 #90: cumprod Speedup: 1.26× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 2.81× Mem eff: 1.50× L1 #92: cumsum exclusive Speedup: 3.36× Mem eff: 1.50× L1 #94: MSELoss Speedup: 3.12× Mem eff: 2.00× L1 #96: HuberLoss Speedup: 2.05× Mem eff: 2.00× L1 #99: TripletMarginLoss Speedup: 4.27× Mem eff: 1.68× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.16× Mem eff: 1.01× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 1.50× Mem eff: 1.01× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 4.24× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 1.07× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.12× Mem eff: 1.01× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.17× Mem eff: 1.02× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.04× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.15× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.18× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.15× Mem eff: 1.04× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.60× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 1.02× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 0.97× Mem eff: 0.74× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.39× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.04× Mem eff: 1.01× L2 #18: Matmul Sum Max AvgPool LogSumExp LogSumExp Speedup: 4.00× Mem eff: 0.50× L2 #19: ConvTranspose2d GELU GroupNorm Speedup: 1.14× Mem eff: 1.00× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 1.86× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.23× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 1.02× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.08× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.47× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.22× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.23× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.20× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.60× L2 #34: ConvTranspose3d LayerNorm GELU Scaling Speedup: 3.02× Mem eff: 1.01× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.33× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 1.07× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.51× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.47× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.19× Mem eff: 0.87× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.21× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.73× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.13× Mem eff: 1.03× L2 #44: ConvTranspose2d Multiply GlobalAvgPool GlobalAvgPool Mean Speedup: 1.23× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 1.25× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.14× Mem eff: 1.01× L2 #47: Conv3d Mish Tanh Speedup: 1.05× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.18× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.59× Mem eff: 1.03× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.32× Mem eff: 1.02× L2 #52: Conv2d Activation BatchNorm Speedup: 1.18× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.65× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.15× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #57: Conv2d ReLU HardSwish Speedup: 1.65× Mem eff: 1.50× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.46× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.22× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 1.35× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.27× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.14× Mem eff: 0.60× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #65: Conv2d AvgPool Sigmoid Sum Speedup: 1.14× Mem eff: 1.06× L2 #67: Conv2d GELU GlobalAvgPool Speedup: 1.10× Mem eff: 1.90× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.51× L2 #69: Conv2d HardSwish ReLU Speedup: 1.06× Mem eff: 1.03× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.65× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.19× Mem eff: 1.03× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.05× Mem eff: 1.02× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.68× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.27× Mem eff: 0.64× L2 #76: Gemm Add ReLU Speedup: 0.15× Mem eff: 0.60× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.02× Mem eff: 1.04× L2 #78: ConvTranspose3d Max Max Sum Speedup: 1.15× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.34× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.65× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.69× Mem eff: 1.56× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.57× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.25× Mem eff: 1.00× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.60× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.19× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.24× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.19× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.23× Mem eff: 1.01× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.35× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.68× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.22× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.65× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.17× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.18× Mem eff: 0.62× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 0.15× Mem eff: 0.58× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.06× Mem eff: 1.00× L3 #1: MLP Speedup: 0.20× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.27× Mem eff: 0.55× L3 #4: LeNet5 Speedup: 1.03× Mem eff: 1.00× L3 #5: AlexNet Speedup: 0.93× Mem eff: 0.94× L3 #6: GoogleNetInceptionModule Speedup: 1.47× Mem eff: 1.14× L3 #7: GoogleNetInceptionV1 Speedup: 0.88× Mem eff: 0.48× L3 #8: ResNetBasicBlock Speedup: 1.02× Mem eff: 0.61× L3 #9: ResNet18 Speedup: 0.92× Mem eff: 0.60× L3 #10: ResNet101 Speedup: 0.90× Mem eff: 0.48× L3 #11: VGG16 Speedup: 0.92× Mem eff: 0.66× L3 #12: VGG19 Speedup: 0.90× Mem eff: 0.57× L3 #13: DenseNet121TransitionLayer Speedup: 1.72× Mem eff: 1.79× L3 #14: DenseNet121DenseBlock Speedup: 1.10× Mem eff: 0.42× L3 #15: DenseNet121 Speedup: 0.86× Mem eff: 0.61× L3 #16: DenseNet201 Speedup: 0.88× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 1.27× Mem eff: 1.00× L3 #18: SqueezeNet Speedup: 1.05× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.92× Mem eff: 0.68× L3 #20: MobileNetV2 Speedup: 0.88× Mem eff: 0.64× L3 #21: EfficientNetMBConv Speedup: 0.91× Mem eff: 0.78× L3 #22: EfficientNetB0 Speedup: 0.81× Mem eff: 0.66× L3 #23: EfficientNetB1 Speedup: 0.92× Mem eff: 0.36× L3 #24: EfficientNetB2 Speedup: 0.83× Mem eff: 0.64× L3 #25: ShuffleNetUnit Speedup: 1.02× Mem eff: 1.01× L3 #26: ShuffleNet Speedup: 1.00× Mem eff: 0.13× L3 #27: RegNet Speedup: 1.10× Mem eff: 0.81× L3 #28: VisionTransformer Speedup: 0.86× Mem eff: 0.46× L3 #29: SwinMLP Speedup: 0.87× Mem eff: 0.69× L3 #30: SwinTransformerV2 Speedup: 0.86× Mem eff: 0.53× L3 #32: ConvolutionalVisionTransformer Speedup: 0.75× Mem eff: 0.81× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.51× L3 #35: LSTM Speedup: 1.02× Mem eff: 0.96× L3 #36: LSTMHn Speedup: 1.03× Mem eff: 0.96× L3 #37: LSTMCn Speedup: 1.01× Mem eff: 0.96× L3 #38: LSTMBidirectional Speedup: 1.01× Mem eff: 0.95× L3 #39: GRU Speedup: 0.98× Mem eff: 1.07× L3 #40: GRUHidden Speedup: 1.00× Mem eff: 1.07× L3 #41: GRUBidirectional Speedup: 1.01× Mem eff: 1.01× L3 #42: GRUBidirectionalHidden Speedup: 1.00× Mem eff: 1.01× L3 #43: MinGPTCausalAttention Speedup: 0.57× Mem eff: 1.60× L3 #44: MiniGPTBlock Speedup: 0.42× Mem eff: 0.89× L3 #46: NetVladWithGhostClusters Speedup: 0.89× Mem eff: 1.08× L3 #47: NetVladNoGhostClusters Speedup: 0.71× Mem eff: 1.10× L3 #50: ReLUSelfAttention Speedup: 0.80× Mem eff: 1.41×

GPT-5.5 (medium)

-0.00 2.00 4.00 6.00 0.70 1.40 2.10 2.80 3.50 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.13× Mem eff: 1.00× L1 #2: Standard matrix multiplication Speedup: 1.00× Mem eff: 1.00× L1 #3: Batched matrix multiplication Speedup: 0.16× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.25× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.99× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.29× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.17× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 1.00× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.42× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.17× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 1.00× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.23× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.24× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 1.00× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.12× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.12× Mem eff: 1.00× L1 #19: ReLU Speedup: 1.16× Mem eff: 2.01× L1 #20: LeakyReLU Speedup: 1.00× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 1.00× Mem eff: 1.00× L1 #22: Tanh Speedup: 1.01× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.08× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 1.02× Mem eff: 1.00× L1 #25: Swish Speedup: 2.35× Mem eff: 1.50× L1 #26: GELU Speedup: 0.99× Mem eff: 1.00× L1 #27: SELU Speedup: 1.00× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 1.00× Mem eff: 1.00× L1 #29: Softplus Speedup: 1.18× Mem eff: 1.00× L1 #30: Softsign Speedup: 3.46× Mem eff: 1.50× L1 #31: ELU Speedup: 1.00× Mem eff: 1.00× L1 #32: HardTanh Speedup: 1.00× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 2.52× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.48× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.70× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 2.92× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 1.31× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.35× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.80× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 3.66× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 3.09× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.41× Mem eff: 1.21× L1 #44: Average Pooling 1D Speedup: 6.28× Mem eff: 1.01× L1 #45: Average Pooling 2D Speedup: 1.15× Mem eff: 1.00× L1 #46: Average Pooling 3D Speedup: 2.45× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 1.03× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 1.04× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.36× Mem eff: 1.01× L1 #50: conv standard 2D square input square kernel Speedup: 0.16× Mem eff: 2.07× L1 #51: Argmax over a dimension Speedup: 1.34× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.28× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.36× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.38× Mem eff: 1.03× L1 #55: conv standard 2D asymmetric input square kernel Speedup: 0.07× Mem eff: 1.34× L1 #56: conv standard 2D asymmetric input asymmetric kernel Speedup: 1.04× Mem eff: 1.21× L1 #57: conv transposed 2D square input square kernel Speedup: 1.01× Mem eff: 1.00× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 0.34× Mem eff: 2.29× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.91× Mem eff: 0.97× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 0.45× Mem eff: 1.04× L1 #61: conv transposed 3D square input square kernel Speedup: 0.93× Mem eff: 1.02× L1 #62: conv standard 2D square input asymmetric kernel Speedup: 0.15× Mem eff: 2.04× L1 #63: conv standard 2D square input square kernel Speedup: 0.15× Mem eff: 1.11× L1 #64: conv transposed 1D Speedup: 0.09× Mem eff: 2.01× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 0.98× Mem eff: 1.02× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 0.51× Mem eff: 1.07× L1 #67: conv standard 1D Speedup: 0.14× Mem eff: 1.34× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 0.16× Mem eff: 1.01× L1 #69: conv transposed 2D asymmetric input asymmetric kernel Speedup: 1.11× Mem eff: 1.01× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 1.02× Mem eff: 1.01× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 0.18× Mem eff: 2.03× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 0.38× Mem eff: 1.26× L1 #73: conv transposed 3D asymmetric input square kernel strided padded grouped Speedup: 0.06× Mem eff: 2.03× L1 #74: conv transposed 1D dilated Speedup: 0.12× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 2.94× Mem eff: 1.04× L1 #76: conv standard 1D dilated strided Speedup: 0.15× Mem eff: 1.60× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 3.81× Mem eff: 1.85× L1 #78: conv transposed 2D asymmetric input asymmetric kernel padded Speedup: 0.14× Mem eff: 2.03× L1 #79: conv transposed 1D asymmetric input square kernel padded strided dilated Speedup: 0.38× Mem eff: 2.02× L1 #80: conv standard 2D square input asymmetric kernel dilated padded Speedup: 0.08× Mem eff: 2.04× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.81× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 2.59× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 5.18× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 2.65× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 2.71× Mem eff: 1.03× L1 #86: conv depthwise separable 2D Speedup: 1.17× Mem eff: 1.01× L1 #87: conv pointwise 2D Speedup: 0.53× Mem eff: 1.00× L1 #89: cumsum Speedup: 2.00× Mem eff: 1.00× L1 #90: cumprod Speedup: 4.32× Mem eff: 1.00× L1 #91: cumsum reverse Speedup: 2.36× Mem eff: 1.50× L1 #92: cumsum exclusive Speedup: 3.07× Mem eff: 1.50× L1 #93: masked cumsum Speedup: 2.92× Mem eff: 1.78× L1 #94: MSELoss Speedup: 3.21× Mem eff: 2.00× L1 #96: HuberLoss Speedup: 2.13× Mem eff: 2.00× L1 #98: KLDivLoss Speedup: 5.68× Mem eff: 3.02× L1 #99: TripletMarginLoss Speedup: 4.41× Mem eff: 1.68× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.24× Mem eff: 1.25× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 2.45× Mem eff: 1.01× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 2.49× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 1.12× Mem eff: 1.25× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.08× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 1.43× Mem eff: 2.04× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.78× Mem eff: 0.90× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.05× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.15× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.20× Mem eff: 1.00× L2 #11: ConvTranspose2d BatchNorm Tanh MaxPool GroupNorm Speedup: 1.15× Mem eff: 1.02× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.63× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 2.28× Mem eff: 1.06× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.42× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.20× Mem eff: 0.84× L2 #19: ConvTranspose2d GELU GroupNorm Speedup: 1.08× Mem eff: 1.00× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 2.19× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.36× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 0.80× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.07× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.46× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.22× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.63× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.25× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.22× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.60× L2 #34: ConvTranspose3d LayerNorm GELU Scaling Speedup: 3.69× Mem eff: 1.01× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.43× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.98× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.59× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.49× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.19× Mem eff: 0.93× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.67× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.93× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.05× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 1.25× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.73× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.06× Mem eff: 1.03× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.19× Mem eff: 1.88× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.55× Mem eff: 1.03× L2 #52: Conv2d Activation BatchNorm Speedup: 1.26× Mem eff: 1.42× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.71× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.30× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.40× Mem eff: 0.41× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #57: Conv2d ReLU HardSwish Speedup: 1.84× Mem eff: 2.83× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.06× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 1.32× Mem eff: 1.05× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.27× Mem eff: 0.64× L2 #63: Gemm ReLU Divide Speedup: 0.14× Mem eff: 0.63× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #65: Conv2d AvgPool Sigmoid Sum Speedup: 5.66× Mem eff: 3.18× L2 #66: Matmul Dropout Softmax Speedup: 0.22× Mem eff: 0.51× L2 #67: Conv2d GELU GlobalAvgPool Speedup: 1.09× Mem eff: 1.90× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.52× L2 #69: Conv2d HardSwish ReLU Speedup: 1.18× Mem eff: 1.94× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.68× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.19× Mem eff: 1.94× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.04× Mem eff: 1.02× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.28× Mem eff: 1.03× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.71× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.26× Mem eff: 0.61× L2 #76: Gemm Add ReLU Speedup: 1.07× Mem eff: 0.63× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.09× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 1.05× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.40× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.68× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.80× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.57× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.71× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.60× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.34× Mem eff: 1.90× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.24× Mem eff: 0.68× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.21× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.40× Mem eff: 1.88× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.19× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.56× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.21× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.68× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.20× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.18× Mem eff: 0.68× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 1.45× Mem eff: 0.59× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.63× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.17× Mem eff: 1.00× L3 #1: MLP Speedup: 0.21× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.27× Mem eff: 0.55× L3 #5: AlexNet Speedup: 1.20× Mem eff: 1.09× L3 #6: GoogleNetInceptionModule Speedup: 1.11× Mem eff: 1.14× L3 #7: GoogleNetInceptionV1 Speedup: 0.95× Mem eff: 0.83× L3 #8: ResNetBasicBlock Speedup: 0.99× Mem eff: 0.71× L3 #9: ResNet18 Speedup: 0.94× Mem eff: 0.64× L3 #10: ResNet101 Speedup: 0.93× Mem eff: 0.21× L3 #11: VGG16 Speedup: 1.14× Mem eff: 0.77× L3 #12: VGG19 Speedup: 1.00× Mem eff: 0.66× L3 #13: DenseNet121TransitionLayer Speedup: 1.69× Mem eff: 1.79× L3 #14: DenseNet121DenseBlock Speedup: 0.99× Mem eff: 0.48× L3 #15: DenseNet121 Speedup: 0.96× Mem eff: 0.73× L3 #16: DenseNet201 Speedup: 1.02× Mem eff: 0.54× L3 #17: SqueezeNetFireModule Speedup: 0.98× Mem eff: 1.94× L3 #18: SqueezeNet Speedup: 1.09× Mem eff: 1.01× L3 #19: MobileNetV1 Speedup: 0.88× Mem eff: 0.28× L3 #20: MobileNetV2 Speedup: 0.76× Mem eff: 0.18× L3 #21: EfficientNetMBConv Speedup: 0.89× Mem eff: 0.64× L3 #22: EfficientNetB0 Speedup: 0.93× Mem eff: 0.66× L3 #23: EfficientNetB1 Speedup: 0.85× Mem eff: 0.27× L3 #24: EfficientNetB2 Speedup: 0.80× Mem eff: 0.63× L3 #25: ShuffleNetUnit Speedup: 1.13× Mem eff: 1.05× L3 #26: ShuffleNet Speedup: 0.98× Mem eff: 0.20× L3 #27: RegNet Speedup: 1.00× Mem eff: 0.50× L3 #28: VisionTransformer Speedup: 0.83× Mem eff: 0.46× L3 #29: SwinMLP Speedup: 0.95× Mem eff: 0.58× L3 #30: SwinTransformerV2 Speedup: 0.98× Mem eff: 0.51× L3 #32: ConvolutionalVisionTransformer Speedup: 1.99× Mem eff: 0.82× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.52× L3 #35: LSTM Speedup: 1.02× Mem eff: 0.96× L3 #36: LSTMHn Speedup: 0.97× Mem eff: 2.16× L3 #37: LSTMCn Speedup: 1.00× Mem eff: 0.96× L3 #38: LSTMBidirectional Speedup: 1.01× Mem eff: 0.95× L3 #40: GRUHidden Speedup: 0.55× Mem eff: 3.37× L3 #41: GRUBidirectional Speedup: 0.99× Mem eff: 1.01× L3 #42: GRUBidirectionalHidden Speedup: 0.94× Mem eff: 1.01× L3 #43: MinGPTCausalAttention Speedup: 0.62× Mem eff: 2.36× L3 #44: MiniGPTBlock Speedup: 0.51× Mem eff: 1.25× L3 #45: UNetSoftmax Speedup: 0.97× Mem eff: 0.46× L3 #46: NetVladWithGhostClusters Speedup: 0.92× Mem eff: 1.08× L3 #47: NetVladNoGhostClusters Speedup: 0.48× Mem eff: 1.10× L3 #50: ReLUSelfAttention Speedup: 0.84× Mem eff: 1.41×

Kimi K2.6

-0.00 0.60 1.20 1.80 2.40 0.50 1.00 1.50 2.00 Speedup (×) Mem Efficiency (×) L1 #1: Square matrix multiplication Speedup: 0.02× Mem eff: 1.00× L1 #2: Standard matrix multiplication Speedup: 0.02× Mem eff: 1.00× L1 #3: Batched matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #4: Matrix vector multiplication Speedup: 1.22× Mem eff: 1.00× L1 #5: Matrix scalar multiplication Speedup: 0.98× Mem eff: 1.00× L1 #6: Matmul with large K dimension Speedup: 0.02× Mem eff: 1.00× L1 #7: Matmul with small K dimension Speedup: 0.34× Mem eff: 1.00× L1 #8: Matmul with irregular shapes Speedup: 0.07× Mem eff: 1.00× L1 #9: Tall skinny matrix multiplication Speedup: 0.22× Mem eff: 1.00× L1 #10: 3D tensor matrix multiplication Speedup: 0.03× Mem eff: 1.00× L1 #11: 4D tensor matrix multiplication Speedup: 0.04× Mem eff: 1.00× L1 #13: Matmul for symmetric matrices Speedup: 0.11× Mem eff: 1.00× L1 #14: Matmul for upper triangular matrices Speedup: 0.14× Mem eff: 1.29× L1 #15: Matmul for lower triangular matrices Speedup: 0.15× Mem eff: 1.29× L1 #16: Matmul with transposed A Speedup: 0.13× Mem eff: 1.00× L1 #17: Matmul with transposed B Speedup: 0.06× Mem eff: 1.00× L1 #18: Matmul with transposed both Speedup: 0.02× Mem eff: 1.00× L1 #19: ReLU Speedup: 0.95× Mem eff: 1.00× L1 #20: LeakyReLU Speedup: 0.99× Mem eff: 1.00× L1 #21: Sigmoid Speedup: 0.70× Mem eff: 1.00× L1 #22: Tanh Speedup: 1.01× Mem eff: 1.00× L1 #23: Softmax Speedup: 1.49× Mem eff: 1.00× L1 #24: LogSoftmax Speedup: 1.29× Mem eff: 1.00× L1 #25: Swish Speedup: 2.47× Mem eff: 1.50× L1 #26: GELU Speedup: 0.95× Mem eff: 1.00× L1 #27: SELU Speedup: 0.93× Mem eff: 1.00× L1 #28: HardSigmoid Speedup: 0.96× Mem eff: 1.00× L1 #29: Softplus Speedup: 0.97× Mem eff: 1.00× L1 #30: Softsign Speedup: 2.51× Mem eff: 1.50× L1 #31: ELU Speedup: 0.95× Mem eff: 1.00× L1 #32: HardTanh Speedup: 0.91× Mem eff: 1.00× L1 #33: BatchNorm Speedup: 1.00× Mem eff: 1.00× L1 #34: InstanceNorm Speedup: 1.43× Mem eff: 1.00× L1 #35: GroupNorm Speedup: 1.47× Mem eff: 1.00× L1 #36: RMSNorm Speedup: 2.11× Mem eff: 1.01× L1 #37: FrobeniusNorm Speedup: 1.28× Mem eff: 1.00× L1 #38: L1Norm Speedup: 1.04× Mem eff: 1.00× L1 #39: L2Norm Speedup: 0.89× Mem eff: 1.00× L1 #41: Max Pooling 1D Speedup: 2.25× Mem eff: 2.00× L1 #42: Max Pooling 2D Speedup: 1.33× Mem eff: 2.01× L1 #43: Max Pooling 3D Speedup: 1.19× Mem eff: 1.21× L1 #45: Average Pooling 2D Speedup: 1.13× Mem eff: 1.00× L1 #46: Average Pooling 3D Speedup: 1.54× Mem eff: 1.00× L1 #47: Sum reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #48: Mean reduction over a dimension Speedup: 0.90× Mem eff: 1.00× L1 #49: Max reduction over a dimension Speedup: 1.18× Mem eff: 1.01× L1 #50: conv standard 2D square input square kernel Speedup: 0.12× Mem eff: 2.07× L1 #51: Argmax over a dimension Speedup: 1.09× Mem eff: 1.00× L1 #52: Argmin over a dimension Speedup: 1.20× Mem eff: 1.00× L1 #53: Min reduction over a dimension Speedup: 1.18× Mem eff: 1.01× L1 #54: conv standard 3D square input square kernel Speedup: 0.07× Mem eff: 1.03× L1 #55: conv standard 2D asymmetric input square kernel Speedup: 0.03× Mem eff: 1.34× L1 #56: conv standard 2D asymmetric input asymmetric kernel Speedup: 0.03× Mem eff: 2.04× L1 #57: conv transposed 2D square input square kernel Speedup: 0.03× Mem eff: 2.01× L1 #58: conv transposed 3D asymmetric input asymmetric kernel Speedup: 0.18× Mem eff: 2.29× L1 #59: conv standard 3D asymmetric input square kernel Speedup: 0.06× Mem eff: 1.01× L1 #60: conv standard 3D square input asymmetric kernel Speedup: 0.12× Mem eff: 1.04× L1 #61: conv transposed 3D square input square kernel Speedup: 0.01× Mem eff: 2.04× L1 #62: conv standard 2D square input asymmetric kernel Speedup: 0.03× Mem eff: 2.04× L1 #63: conv standard 2D square input square kernel Speedup: 0.02× Mem eff: 1.11× L1 #64: conv transposed 1D Speedup: 0.02× Mem eff: 2.01× L1 #65: conv transposed 2D square input asymmetric kernel Speedup: 0.08× Mem eff: 2.03× L1 #66: conv standard 3D asymmetric input asymmetric kernel Speedup: 0.11× Mem eff: 1.07× L1 #67: conv standard 1D Speedup: 0.03× Mem eff: 1.34× L1 #68: conv transposed 3D square input asymmetric kernel Speedup: 0.01× Mem eff: 2.02× L1 #69: conv transposed 2D asymmetric input asymmetric kernel Speedup: 0.03× Mem eff: 2.02× L1 #70: conv transposed 3D asymmetric input square kernel Speedup: 0.06× Mem eff: 2.02× L1 #71: conv transposed 2D asymmetric input square kernel Speedup: 0.06× Mem eff: 2.03× L1 #72: conv transposed 3D asymmetric input asymmetric kernel strided padded grouped Speedup: 0.18× Mem eff: 1.26× L1 #74: conv transposed 1D dilated Speedup: 0.07× Mem eff: 2.02× L1 #75: conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated Speedup: 0.17× Mem eff: 1.04× L1 #76: conv standard 1D dilated strided Speedup: 0.04× Mem eff: 1.60× L1 #77: conv transposed 3D square input square kernel padded dilated strided Speedup: 0.08× Mem eff: 2.05× L1 #78: conv transposed 2D asymmetric input asymmetric kernel padded Speedup: 0.04× Mem eff: 2.03× L1 #79: conv transposed 1D asymmetric input square kernel padded strided dilated Speedup: 0.33× Mem eff: 2.02× L1 #80: conv standard 2D square input asymmetric kernel dilated padded Speedup: 0.04× Mem eff: 2.04× L1 #81: conv transposed 2D asymmetric input square kernel dilated padded strided Speedup: 0.03× Mem eff: 2.04× L1 #82: conv depthwise 2D square input square kernel Speedup: 1.16× Mem eff: 1.02× L1 #83: conv depthwise 2D square input asymmetric kernel Speedup: 1.57× Mem eff: 1.03× L1 #84: conv depthwise 2D asymmetric input square kernel Speedup: 0.89× Mem eff: 1.00× L1 #85: conv depthwise 2D asymmetric input asymmetric kernel Speedup: 1.06× Mem eff: 1.03× L1 #86: conv depthwise separable 2D Speedup: 0.02× Mem eff: 1.34× L1 #87: conv pointwise 2D Speedup: 0.03× Mem eff: 1.00× L1 #89: cumsum Speedup: 1.99× Mem eff: 1.00× L1 #90: cumprod Speedup: 0.86× Mem eff: 1.00× L1 #93: masked cumsum Speedup: 1.00× Mem eff: 1.00× L1 #96: HuberLoss Speedup: 2.08× Mem eff: 2.00× L2 #1: Conv2D ReLU BiasAdd Speedup: 1.10× Mem eff: 1.01× L2 #2: ConvTranspose2d BiasAdd Clamp Scaling Clamp Divide Speedup: 1.83× Mem eff: 1.01× L2 #3: ConvTranspose3d Sum LayerNorm AvgPool GELU Speedup: 2.21× Mem eff: 1.01× L2 #4: Conv2d Mish Mish Speedup: 0.96× Mem eff: 1.01× L2 #5: ConvTranspose2d Subtract Tanh Speedup: 1.08× Mem eff: 1.01× L2 #6: Conv3d Softmax MaxPool MaxPool Speedup: 1.41× Mem eff: 2.04× L2 #7: Conv3d ReLU LeakyReLU GELU Sigmoid BiasAdd Speedup: 1.20× Mem eff: 1.02× L2 #8: Conv3d Divide Max GlobalAvgPool BiasAdd Sum Speedup: 1.07× Mem eff: 1.67× L2 #9: Matmul Subtract Multiply ReLU Speedup: 0.15× Mem eff: 0.63× L2 #10: ConvTranspose2d MaxPool Hardtanh Mean Tanh Speedup: 1.18× Mem eff: 1.00× L2 #12: Gemm Multiply LeakyReLU Speedup: 0.14× Mem eff: 0.60× L2 #13: ConvTranspose3d Mean Add Softmax Tanh Scaling Speedup: 0.72× Mem eff: 1.01× L2 #15: ConvTranspose3d BatchNorm Subtract Speedup: 1.01× Mem eff: 0.74× L2 #16: ConvTranspose2d Mish Add Hardtanh Scaling Speedup: 1.50× Mem eff: 1.01× L2 #17: Conv2d InstanceNorm Divide Speedup: 1.27× Mem eff: 1.01× L2 #19: ConvTranspose2d GELU GroupNorm Speedup: 1.07× Mem eff: 1.00× L2 #20: ConvTranspose3d Sum ResidualAdd Multiply ResidualAdd Speedup: 1.69× Mem eff: 1.47× L2 #21: Conv2d Add Scale Sigmoid GroupNorm Speedup: 1.21× Mem eff: 1.01× L2 #22: Matmul Scale ResidualAdd Clamp LogSumExp Mish Speedup: 0.18× Mem eff: 0.63× L2 #24: Conv3d Min Softmax Speedup: 1.02× Mem eff: 1.19× L2 #25: Conv2d Min Tanh Tanh Speedup: 1.08× Mem eff: 1.04× L2 #26: ConvTranspose3d Add HardSwish Speedup: 1.45× Mem eff: 1.31× L2 #27: Conv3d HardSwish GroupNorm Mean Speedup: 1.25× Mem eff: 1.02× L2 #28: BMM InstanceNorm Sum ResidualAdd Multiply Speedup: 0.17× Mem eff: 0.62× L2 #29: Matmul Mish Mish Speedup: 0.15× Mem eff: 0.60× L2 #30: Gemm GroupNorm Hardtanh Speedup: 0.18× Mem eff: 0.60× L2 #31: Conv2d Min Add Multiply Speedup: 1.24× Mem eff: 1.01× L2 #32: Conv2d Scaling Min Speedup: 1.20× Mem eff: 1.25× L2 #33: Gemm Scale BatchNorm Speedup: 0.15× Mem eff: 0.57× L2 #34: ConvTranspose3d LayerNorm GELU Scaling Speedup: 1.10× Mem eff: 1.01× L2 #35: Conv2d Subtract HardSwish MaxPool Mish Speedup: 1.41× Mem eff: 1.25× L2 #36: ConvTranspose2d Min Sum GELU Add Speedup: 0.32× Mem eff: 1.03× L2 #37: Matmul Swish Sum GroupNorm Speedup: 0.53× Mem eff: 1.41× L2 #38: ConvTranspose3d AvgPool Clamp Softmax Multiply Speedup: 1.38× Mem eff: 1.01× L2 #39: Gemm Scale BatchNorm Speedup: 0.18× Mem eff: 0.73× L2 #40: Matmul Scaling ResidualAdd Speedup: 0.18× Mem eff: 1.21× L2 #41: Gemm BatchNorm GELU ReLU Speedup: 0.19× Mem eff: 0.72× L2 #42: ConvTranspose2d GlobalAvgPool BiasAdd LogSumExp Sum Multiply Speedup: 1.00× Mem eff: 1.01× L2 #43: Conv3d Max LogSumExp ReLU Speedup: 1.13× Mem eff: 1.03× L2 #44: ConvTranspose2d Multiply GlobalAvgPool GlobalAvgPool Mean Speedup: 1.14× Mem eff: 1.03× L2 #45: Gemm Sigmoid LogSumExp Speedup: 0.17× Mem eff: 1.25× L2 #46: Conv2d Subtract Tanh Subtract AvgPool Speedup: 1.63× Mem eff: 1.25× L2 #47: Conv3d Mish Tanh Speedup: 1.02× Mem eff: 1.23× L2 #48: Conv3d Scaling Tanh Multiply Sigmoid Speedup: 1.17× Mem eff: 1.03× L2 #49: ConvTranspose3d Softmax Sigmoid Speedup: 1.40× Mem eff: 1.03× L2 #50: ConvTranspose3d Scaling AvgPool BiasAdd Scaling Speedup: 1.32× Mem eff: 1.02× L2 #51: Gemm Subtract GlobalAvgPool LogSumExp GELU ResidualAdd Speedup: 0.16× Mem eff: 0.68× L2 #52: Conv2d Activation BatchNorm Speedup: 1.17× Mem eff: 1.02× L2 #53: Gemm Scaling Hardtanh GELU Speedup: 0.16× Mem eff: 0.71× L2 #54: Conv2d Multiply LeakyReLU GELU Speedup: 1.15× Mem eff: 1.01× L2 #55: Matmul MaxPool Sum Scale Speedup: 0.22× Mem eff: 0.51× L2 #56: Matmul Sigmoid Sum Speedup: 0.22× Mem eff: 0.51× L2 #58: ConvTranspose3d LogSumExp HardSwish Subtract Clamp Speedup: 1.45× Mem eff: 1.06× L2 #59: Matmul Swish Scaling Speedup: 0.22× Mem eff: 0.51× L2 #60: ConvTranspose3d Swish GroupNorm HardSwish Speedup: 1.12× Mem eff: 1.48× L2 #61: ConvTranspose3d ReLU GroupNorm Speedup: 1.16× Mem eff: 1.03× L2 #62: Matmul GroupNorm LeakyReLU Sum Speedup: 0.26× Mem eff: 0.61× L2 #63: Gemm ReLU Divide Speedup: 0.14× Mem eff: 0.60× L2 #64: Gemm LogSumExp LeakyReLU LeakyReLU GELU GELU Speedup: 0.16× Mem eff: 0.63× L2 #65: Conv2d AvgPool Sigmoid Sum Speedup: 1.14× Mem eff: 1.06× L2 #66: Matmul Dropout Softmax Speedup: 0.22× Mem eff: 0.51× L2 #67: Conv2d GELU GlobalAvgPool Speedup: 1.11× Mem eff: 1.90× L2 #68: Matmul Min Subtract Speedup: 0.22× Mem eff: 0.51× L2 #69: Conv2d HardSwish ReLU Speedup: 1.17× Mem eff: 1.03× L2 #70: Gemm Sigmoid Scaling ResidualAdd Speedup: 0.15× Mem eff: 0.65× L2 #71: Conv2d Divide LeakyReLU Speedup: 1.17× Mem eff: 1.03× L2 #72: ConvTranspose3d BatchNorm AvgPool AvgPool Speedup: 1.00× Mem eff: 0.97× L2 #73: Conv2d BatchNorm Scaling Speedup: 1.00× Mem eff: 0.70× L2 #74: ConvTranspose3d LeakyReLU Multiply LeakyReLU Max Speedup: 1.74× Mem eff: 1.06× L2 #75: Gemm GroupNorm Min BiasAdd Speedup: 0.27× Mem eff: 0.64× L2 #76: Gemm Add ReLU Speedup: 0.15× Mem eff: 0.60× L2 #77: ConvTranspose3d Scale BatchNorm GlobalAvgPool Speedup: 1.03× Mem eff: 1.05× L2 #78: ConvTranspose3d Max Max Sum Speedup: 0.75× Mem eff: 1.01× L2 #79: Conv3d Multiply InstanceNorm Clamp Multiply Max Speedup: 1.43× Mem eff: 1.97× L2 #81: Gemm Swish Divide Clamp Tanh Clamp Speedup: 0.17× Mem eff: 0.68× L2 #82: Conv2d Tanh Scaling BiasAdd Max Speedup: 1.77× Mem eff: 1.80× L2 #84: Gemm BatchNorm Scaling Softmax Speedup: 0.16× Mem eff: 0.55× L2 #85: Conv2d GroupNorm Scale MaxPool Clamp Speedup: 1.58× Mem eff: 1.84× L2 #86: Matmul Divide GELU Speedup: 0.14× Mem eff: 0.60× L2 #87: Conv2d Subtract Subtract Mish Speedup: 1.32× Mem eff: 1.01× L2 #88: Gemm GroupNorm Swish Multiply Swish Speedup: 0.23× Mem eff: 0.65× L2 #89: ConvTranspose3d MaxPool Softmax Subtract Swish Max Speedup: 1.12× Mem eff: 1.02× L2 #90: Conv3d LeakyReLU Sum Clamp GELU Speedup: 1.25× Mem eff: 1.88× L2 #91: ConvTranspose2d Softmax BiasAdd Scaling Sigmoid Speedup: 2.35× Mem eff: 1.01× L2 #93: ConvTranspose2d Add Min GELU Multiply Speedup: 1.50× Mem eff: 1.01× L2 #94: Gemm BiasAdd Hardtanh Mish GroupNorm Speedup: 0.22× Mem eff: 0.60× L2 #95: Matmul Add Swish Tanh GELU Hardtanh Speedup: 0.18× Mem eff: 0.68× L2 #96: ConvTranspose3d Multiply Max GlobalAvgPool Clamp Speedup: 1.18× Mem eff: 1.02× L2 #97: Matmul BatchNorm BiasAdd Divide Swish Speedup: 0.17× Mem eff: 0.48× L2 #98: Matmul AvgPool GELU Scale Max Speedup: 0.15× Mem eff: 0.58× L2 #99: Matmul GELU Softmax Speedup: 0.15× Mem eff: 0.60× L2 #100: ConvTranspose3d Clamp Min Divide Speedup: 1.09× Mem eff: 1.00× L3 #1: MLP Speedup: 0.21× Mem eff: 0.51× L3 #2: ShallowWideMLP Speedup: 0.21× Mem eff: 0.50× L3 #3: DeepNarrowMLP Speedup: 0.26× Mem eff: 0.55× L3 #5: AlexNet Speedup: 0.99× Mem eff: 0.82× L3 #6: GoogleNetInceptionModule Speedup: 0.16× Mem eff: 1.47× L3 #7: GoogleNetInceptionV1 Speedup: 0.88× Mem eff: 0.48× L3 #8: ResNetBasicBlock Speedup: 1.02× Mem eff: 0.61× L3 #9: ResNet18 Speedup: 0.92× Mem eff: 0.60× L3 #10: ResNet101 Speedup: 0.87× Mem eff: 0.48× L3 #11: VGG16 Speedup: 1.00× Mem eff: 0.67× L3 #12: VGG19 Speedup: 1.06× Mem eff: 0.66× L3 #13: DenseNet121TransitionLayer Speedup: 1.30× Mem eff: 1.34× L3 #14: DenseNet121DenseBlock Speedup: 1.00× Mem eff: 0.52× L3 #15: DenseNet121 Speedup: 1.09× Mem eff: 0.18× L3 #16: DenseNet201 Speedup: 1.13× Mem eff: 0.15× L3 #17: SqueezeNetFireModule Speedup: 1.23× Mem eff: 1.00× L3 #18: SqueezeNet Speedup: 1.01× Mem eff: 0.88× L3 #19: MobileNetV1 Speedup: 0.84× Mem eff: 0.28× L3 #23: EfficientNetB1 Speedup: 0.95× Mem eff: 0.27× L3 #26: ShuffleNet Speedup: 0.98× Mem eff: 0.20× L3 #27: RegNet Speedup: 1.01× Mem eff: 0.50× L3 #28: VisionTransformer Speedup: 0.75× Mem eff: 0.46× L3 #29: SwinMLP Speedup: 0.03× Mem eff: 1.01× L3 #31: VisionAttention Speedup: 0.80× Mem eff: 1.00× L3 #32: ConvolutionalVisionTransformer Speedup: 0.84× Mem eff: 0.86× L3 #33: VanillaRNN Speedup: 0.16× Mem eff: 0.51× L3 #34: VanillaRNNHidden Speedup: 0.75× Mem eff: 0.96× L3 #36: LSTMHn Speedup: 0.91× Mem eff: 0.96× L3 #37: LSTMCn Speedup: 1.01× Mem eff: 0.96× L3 #38: LSTMBidirectional Speedup: 0.94× Mem eff: 0.95× L3 #43: MinGPTCausalAttention Speedup: 0.33× Mem eff: 1.02× L3 #44: MiniGPTBlock Speedup: 0.32× Mem eff: 1.55× L3 #45: UNetSoftmax Speedup: 1.01× Mem eff: 0.45× L3 #46: NetVladWithGhostClusters Speedup: 0.26× Mem eff: 1.29× L3 #50: ReLUSelfAttention Speedup: 0.80× Mem eff: 1.41×

Per-Problem Speedup

Best-of-5 speedup over PyTorch baseline on H200. 💡 Click a problem name to view its definition  ·  click a speedup number to view the best kernel generated by that model.

Fast (≥3×) Beats baseline (>1×) Correct but slower (<1×) No correct solution Pending (BF16)
ProblemClaude Opus 4.7 (high)Claude Opus 4.8 (high)Claude Sonnet 4.6 (high)Gemini 3 Flash (high)Gemini 3.1 Pro (high)GPT-5.5 (medium)Kimi K2.6
1 Square matrix multiplication0.19×0.08×0.12×0.02×0.08×0.13×0.02×
2 Standard matrix multiplication0.22×0.06×0.13×0.02×0.02×1.00×0.02×
3 Batched matrix multiplication0.09×0.03×0.16×0.03×0.03×0.16×0.03×
4 Matrix vector multiplication1.21×1.22×1.07×1.24×1.23×1.25×1.22×
5 Matrix scalar multiplication1.00×1.00×0.63×0.62×1.00×1.00×0.98×
6 Matmul with large K dimension0.00×0.05×0.29×0.05×0.16×0.29×0.02×
7 Matmul with small K dimension0.27×0.21×0.08×0.12×0.32×0.17×0.34×
8 Matmul with irregular shapes0.25×0.07×0.38×0.07×0.07×1.00×0.07×
9 Tall skinny matrix multiplication0.38×0.13×0.02×0.17×0.42×0.42×0.22×
10 3D tensor matrix multiplication0.12×0.03×0.17×0.03×0.03×0.17×0.03×
11 4D tensor matrix multiplication0.17×0.04×0.17×0.04×0.11×0.17×0.04×
12 Matmul with diagonal matrices8.28×8.30×4.51×5.63×5.81×8.64×5.55×
13 Matmul for symmetric matrices0.26×0.05×0.13×0.02×0.09×1.00×0.11×
14 Matmul for upper triangular matrices0.39×0.15×0.10×0.15×0.43×0.23×0.14×
15 Matmul for lower triangular matrices0.27×0.15×0.11×0.15×0.42×0.24×0.15×
16 Matmul with transposed A0.17×0.03×0.21×0.03×0.03×1.00×0.13×
17 Matmul with transposed B0.39×0.01×0.12×0.02×0.09×0.12×0.06×
18 Matmul with transposed both0.07×0.01×0.01×0.02×0.02×0.12×0.02×
19 ReLU0.95×1.01×0.63×1.00×1.00×1.16×0.95×
20 LeakyReLU1.00×0.95×0.63×1.00×1.00×1.00×0.99×
21 Sigmoid1.00×0.95×0.58×0.58×1.00×1.00×0.70×
22 Tanh1.01×0.96×0.61×0.60×1.01×1.01×1.01×
23 Softmax0.90×1.35×1.12×1.49×1.51×1.08×1.49×
24 LogSoftmax0.96×0.96×0.99×1.00×1.31×1.02×1.29×
25 Swish2.47×2.35×1.44×1.45×2.48×2.35×2.47×
26 GELU0.99×1.00×0.98×0.99×0.99×0.95×
27 SELU1.00×0.95×0.62×0.61×1.00×1.00×0.93×
28 HardSigmoid1.00×0.95×0.60×1.00×1.00×1.00×0.96×
29 Softplus1.19×1.15×0.69×0.67×1.19×1.18×0.97×
30 Softsign3.47×3.29×2.07×2.07×3.47×3.46×2.51×
31 ELU1.00×0.95×0.62×0.99×1.00×1.00×0.95×
32 HardTanh1.00×0.95×0.63×1.00×1.00×1.00×0.91×
33 BatchNorm1.62×0.24×0.87×0.99×1.24×2.52×1.00×
34 InstanceNorm1.50×1.42×1.43×1.00×1.50×1.48×1.43×
35 GroupNorm1.30×1.38×1.29×1.05×1.57×1.70×1.47×
36 RMSNorm2.11×2.11×0.24×2.93×2.99×2.92×2.11×
37 FrobeniusNorm1.73×1.26×0.86×1.25×1.27×1.31×1.28×
38 L1Norm1.42×1.19×1.77×1.33×1.53×1.35×1.04×
39 L2Norm0.74×0.89×0.90×0.90×0.80×0.89×
40 LayerNorm5.93×3.14×5.27×2.52×29.96×25.14×25.14×
41 Max Pooling 1D1.71×2.10×3.02×2.28×2.88×3.66×2.25×
42 Max Pooling 2D2.07×1.49×1.58×1.35×1.36×3.09×1.33×
43 Max Pooling 3D1.46×1.13×1.21×1.17×1.10×1.41×1.19×
44 Average Pooling 1D3.63×3.10×2.43×3.34×3.32×6.28×3.08×
45 Average Pooling 2D1.14×1.12×1.14×1.15×1.13×
46 Average Pooling 3D1.63×1.69×1.82×1.55×1.79×2.45×1.54×
47 Sum reduction over a dimension0.91×0.89×0.83×0.90×0.90×1.03×0.90×
48 Mean reduction over a dimension0.92×0.83×0.90×0.90×0.92×1.04×0.90×
49 Max reduction over a dimension1.20×0.22×1.09×1.19×1.21×1.36×1.18×
50 conv standard 2D square input square kernel0.31×1.00×0.12×0.16×0.12×
51 Argmax over a dimension1.11×1.12×1.11×1.20×1.19×1.34×1.09×
52 Argmin over a dimension1.12×1.19×1.19×1.20×1.28×1.20×
53 Min reduction over a dimension1.20×1.09×1.09×1.18×1.18×1.36×1.18×
54 conv standard 3D square input square kernel0.17×0.07×0.99×0.06×0.06×0.38×0.07×
55 conv standard 2D asymmetric input square kernel0.10×0.01×0.07×0.03×
56 conv standard 2D asymmetric input asymmetric kernel0.05×1.00×0.02×1.04×0.03×
57 conv transposed 2D square input square kernel0.05×1.02×0.03×1.02×1.01×0.03×
58 conv transposed 3D asymmetric input asymmetric kernel0.17×1.13×0.05×1.02×0.34×0.18×
59 conv standard 3D asymmetric input square kernel0.50×0.07×0.50×0.08×1.00×0.91×0.06×
60 conv standard 3D square input asymmetric kernel0.24×0.09×0.09×1.00×0.45×0.12×
61 conv transposed 3D square input square kernel0.07×1.66×0.03×1.65×0.93×0.01×
62 conv standard 2D square input asymmetric kernel0.07×1.53×0.03×0.15×0.03×
63 conv standard 2D square input square kernel0.03×1.01×0.02×0.15×0.02×
64 conv transposed 1D0.07×0.04×0.04×0.09×0.02×
65 conv transposed 2D square input asymmetric kernel0.04×1.03×0.03×1.00×0.98×0.08×
66 conv standard 3D asymmetric input asymmetric kernel0.26×0.10×1.76×0.11×1.00×0.51×0.11×
67 conv standard 1D0.10×0.08×0.05×0.14×0.03×
68 conv transposed 3D square input asymmetric kernel0.03×1.00×0.03×1.00×0.16×0.01×
69 conv transposed 2D asymmetric input asymmetric kernel0.02×1.12×0.02×1.11×0.03×
70 conv transposed 3D asymmetric input square kernel0.08×1.02×0.07×1.03×1.02×0.06×
71 conv transposed 2D asymmetric input square kernel0.05×1.10×0.07×1.00×0.18×0.06×
72 conv transposed 3D asymmetric input asymmetric kernel strided padded grouped0.57×0.50×1.02×0.42×1.00×0. 38×0.18×
73 conv transposed 3D asymmetric input square kernel strided padded grouped0.05×0.05×1.00×0.02×1.00×0.06×
74 conv transposed 1D dilated0.12×0.09×0.08×0.08×0.12×0.07×
75 conv transposed 2D asymmetric input asymmetric kernel strided grouped padded dilated0.48×0.51×0.08×0.48×0.55×2.94×0.17×
76 conv standard 1D dilated strided0.05×0.04×0.04×0.15×0.04×
77 conv transposed 3D square input square kernel padded dilated strided0.08×0.09×1.00×0.09×0.98×3.81×0.08×
78 conv transposed 2D asymmetric input asymmetric kernel padded0.05×0.47×0.05×0.14×0.04×
79 conv transposed 1D asymmetric input square kernel padded strided dilated0.14×0.14×0.38×0.33×
80 conv standard 2D square input asymmetric kernel dilated padded0.05×0.03×0.03×0.08×0.04×
81 conv transposed 2D asymmetric input square kernel dilated padded strided0.19×0.22×0.03×0.19×0.22×0.81×0.03×
82 conv depthwise 2D square input square kernel1.78×1.70×1.33×1.15×1.00×2.59×1.16×
83 conv depthwise 2D square input asymmetric kernel2.64×1.51×1.97×1.58×5.18×1.57×
84 conv depthwise 2D asymmetric input square kernel1.27×1.61×1.09×1.09×2.65×0.89×
85 conv depthwise 2D asymmetric input asymmetric kernel1.36×1.30×1.12×1.07×2.71×1.06×
86 conv depthwise separable 2D1.12×0.02×0.98×1.17×0.02×
87 conv pointwise 2D0.18×0.03×0.53×0.03×
88 MinGPTNewGelu8.52×8.46×5.23×5.23×8.58×8.52×5.23×
89 cumsum0.55×0.64×0.22×0.87×1.16×2.00×1.99×
90 cumprod0.55×0.82×0.62×1.17×1.26×4.32×0.86×
91 cumsum reverse1.23×1.74×0.78×2.87×2.81×2.36×4.28×
92 cumsum exclusive0.83×1.31×1.58×3.36×3.07×
93 masked cumsum1.18×2.92×1.00×
94 MSELoss2.99×2.91×3.12×3.12×3.12×3.21×2.99×
95 CrossEntropyLoss
96 HuberLoss1.90×1.92×0.77×1.70×2.05×2.13×2.08×
97 ScaledDotProductAttention3.15×1.01×3.18×8.32×
98 KLDivLoss3.48×5.47×5.19×3.35×4.19×5.68×4.17×
99 TripletMarginLoss4.27×4.25×3.91×3.94×4.27×4.41×4.26×
100 HingeLoss4.58×3.69×3.64×8.45×1.49×