Pick a tax-aware optimization workload — direct indexing on a US large-cap (500) universe, a 200/100 long/short book, a four-jurisdiction fleet rebalance — and ask the obvious operational question: how much wall-clock does the solver eat per account, per day? CVXPY with CLARABEL is the honest baseline; NVIDIA's cuOPT is the contender. We ran the same problems through both, with the same data and the same constraints, and watched what happened.
The headline finding: for the daily-cadence, account-scoped problem sizes that dominate the platform's compute budget, cuOPT is meaningfully faster — but the after-tax dollar that lands in the account is, within numerical noise, identical. That's the result you want from a solver swap. It also means the case for cuOPT here is operational, not financial.
Solver time is the operational tax you pay for tax-aware investing
Every account on the platform solves a constrained convex optimization once a trading day. For a sleeve tracking the US large-cap (100) universe with lot-level harvesting and a 5% tracking-error budget, the problem has on the order of 10² variables— one weight per name, plus auxiliary variables for the wash-sale lockout. Push to a US large-cap (500) universe and it's 10³. Open the long/short variants and the variable count roughly doubles because every long becomes a long-or-short cone. Add a per-name borrow rate and you've got a holding-cost matrix the solver folds in too.
At fleet scale the multiplier matters. Twenty accounts at one second of solve apiece is twenty seconds — fine. Twenty thousand accounts at one second apiece is five hours, and now your morning rebalance overlaps the open. The reason to care about solve-time isn't the math; it's how many seats the rebalancer can serve before the operating story falls apart.
The case for a faster solver isn't faster solves. It's a longer runway before the daily rebalance bumps into the open.
What we ran, on what hardware, against what
We took the production formulation of each catalog strategy, kept the constraint set identical, and translated the objective between the two solver front-ends:
| Setting | CVXPY · CLARABEL | cuOPT |
|---|---|---|
| Modeling layer | CVXPY 1.5 | cuOPT Python SDK 24.10 |
| Solver | CLARABEL 0.7 (interior-point, sparse) | cuOPT QP/LP (PDHG, GPU) |
| Hardware | AWS m7i.4xlarge — 16 vCPU / 64 GB | AWS g6.2xlarge — 1 × NVIDIA L4 / 32 GB |
| Termination tolerance | ε = 1e-7 (default) | ε = 1e-6 (PDHG default) |
| Universe | US large-cap, 100 / 500 / 1,000 names (point-in-time) | Same |
| Cadence | Daily, 252 trading days × 5 yrs | Same |
The six strategies, side by side
The table below is the workhorse. Median solve-time over 252 daily rebalances, two solvers, the speed-up ratio, and the difference in after-tax terminal NAV against a $1M starting account. Negative speed-up means cuOPT was slower; we annotate why where it happens.
| Strategy / sleeve | Vars · constr. | CLARABEL · ms | cuOPT · ms | Speed-up | Δ after-tax NAV |
|---|---|---|---|---|---|
| Tax-aware DI · US large-cap (100) | 120 · 240 | 180 | 24 | 7.5× | −0.1 bp |
| Tax-aware DI · US large-cap (500) | 520 · 1.05k | 1,420 | 95 | 14.9× | +0.0 bp |
| Tax-aware DI · US large-cap (1,000) | 1,030 · 2.07k | 4,300 | 210 | 20.5× | +0.2 bp |
| Long/short 130/30 · US large-cap (500) | 1,040 · 2.10k | 980 | 140 | 7.0× | −0.1 bp |
| Long/short 200/100 · US large-cap (500) | 1,040 · 2.10k | 2,150 | 260 | 8.3× | +0.1 bp |
| Market-neutral pair · 120 names | 240 · 510 | 240 | 32 | 7.5× | +0.0 bp |
| Multi-jurisdiction · US | 520 · 1.05k | 1,380 | 90 | 15.3× | +0.0 bp |
| Multi-jurisdiction · CA (ACB) | 520 · 1.20k | 1,620 | 105 | 15.4× | +0.0 bp |
| Options overlay · single-name | 8 · 14 | 35 | 38 | 0.9× ↓ | 0 |
| Variable prepaid forward | 6 · 10 | 18 | 22 | 0.8× ↓ | 0 |
Three things stand out. First, the speed-up is monotone in problem size — the bigger the universe, the bigger the cuOPT margin. Second, the long/short strategies see less leverage from the GPU than the long-only ones because the conic structure CLARABEL handles natively isn't free for cuOPT either. Third, the after-tax difference is, across the board, indistinguishable from solver-tolerance noise.
Small problems, dev loop, non-DCP convex shapes
The two strategies where cuOPT regressed — options overlay and VPF — share a property: the problem is small (~10 variables) and the constraints are mostly linear. The solver kernel is fast on either machine; the differentiator is host overhead. Each cuOPT invocation pays a fixed device-transfer cost in the low tens of milliseconds, which dominates kernel time when the kernel is already finishing in under 5 ms. CLARABEL, running on the same NUMA node as the calling Python process, just doesn't have that transfer.
The dev loop is the other place CVXPY wins. CVXPY's modelling layer is more permissive — log-det, geometric-mean, psd-square-root — and DCP errors are reported with line numbers and atom traces. cuOPT's QP/LP front-end is narrower; you express the problem in matrix form, and a malformed constraint manifests as an infeasibility return code rather than a stack trace. For prototyping a new strategy where the constraint shape isn't settled, CVXPY costs less developer time even when it costs more solver time.
Solver-time savings translate into seat-count, not P&L
At the per-account level a 14× speed-up is impressive but meaningless; the morning solve takes a quarter of a second either way. The number that matters is the fleet ceiling — how many accounts can a single rebalancer process between end-of-day and 6am Eastern. Under the synthetic benchmark, an US large-cap (500) tax-aware DI fleet sees:
| Universe | CLARABEL · accounts/window | cuOPT · accounts/window | Ratio |
|---|---|---|---|
| US large-cap (100) | ≈ 100,000 | ≈ 750,000 | 7.5× |
| US large-cap (500) | ≈ 12,700 | ≈ 189,000 | 14.9× |
| US large-cap (1,000) | ≈ 4,200 | ≈ 86,000 | 20.5× |
The corresponding cost arithmetic, at AWS on-demand list prices as of May 2026: an m7i.4xlarge runs about $0.806/hour; a g6.2xlarge runs about $0.978/hour. The GPU instance is roughly 20% more expensive per wall-hour. Under the US large-cap (500) workload, that 20% premium delivers a 14.9× throughput multiplier — a per-solve cost ratio of about 12.5 to 1 in cuOPT's favour. Past a certain fleet size the question stops being "should we use cuOPT" and becomes "how many accounts is the GPU box still saturating."
Both solvers find the same minimum, to about a tenth of a basis point
The Δ after-tax NAV column in the per-strategy table is the load-bearing one for anyone evaluating a solver swap on investment-policy grounds: are we getting the same trades, the same harvests, the same realised P&L? Across 252 trading days and ten configurations, the maximum absolute difference in terminal NAV was 0.2 bp, and the median was zero. Trade-by-trade, the solvers agree on direction, magnitude, and lot identification on more than 99.7% of order tickets. The small set of disagreements is concentrated in days when two lots have nearly-identical cost basis and either choice is objective-optimal.
A two-solver platform, not a swap
The conclusion isn't "cuOPT replaces CVXPY." It's that different solver back-ends fit different jobs, and the platform should be honest about routing each job to the back-end that fits it. The runtime now picks based on the request:
| Job kind | Solver |
|---|---|
| Fleet daily rebalance · ≥ 50 accounts | cuOPT |
| Single-account daily rebalance | cuOPT (large universes), CLARABEL (small) |
| Multi-vintage backtest sweep | cuOPT |
| One-shot tool (transition / harvest / giving) | CLARABEL |
| Options overlay / VPF (small) | CLARABEL |
| R&D / non-DCP exploration | CVXPY (any backend) |
| CI · unit tests | CLARABEL (no GPU on runner) |
Both back-ends consume the same `OptimizeRequest`. The formulation lives in the strategy module and is solver-agnostic; the back-end picks up the resulting `Problem` shape and emits the same `OptimizeResponse`. Switching, in production, is a config flag.
Different solver back-ends fit different jobs. The platform's job is to route each job to the back-end that fits it — and to be boring about the routing.
What this benchmark didn't measure (yet)
Three things this benchmark deliberately leaves on the table, and which the production suite will pick up:
- Warm starts. Both solvers accept a warm-start primal/dual; we ran cold to measure the floor. Warm-started, CLARABEL gains roughly 2× on day-over-day rebalances; cuOPT gains less because PDHG has different convergence behaviour from a warm start. The comparison tightens but doesn't invert.
- Mixed-integer extensions.A few customizations (round-lot trading, tax-lot bucketing beyond HIFO) push the problem mixed-integer. CLARABEL doesn't handle MI; we'd switch to SCIP or HiGHS. cuOPT handles mixed-integer linear via its MIP solver — separate from the QP/LP path benchmarked here. We have a follow-up post planned for the MI numbers.
- Solver-failure handling.Both solvers return non-optimal status codes occasionally (numerical issues on poorly-scaled Σ, near-degenerate constraint sets). The router has identical fall-through logic for both — if a solve returns anything other than `OPTIMAL`, it falls through to a CLARABEL re-solve at tighter tolerance, and if that fails, the previous day's weights persist with an alert. Failure-rate parity matters as much as median speed.
- CLARABEL — Goulart & Chen (2024). Clarabel: an Interior-Point Solver for Convex Optimization in Symmetric Cones. oxfordcontrol.github.io/ClarabelDocs
- NVIDIA cuOPT — first-order primal-dual hybrid gradient (PDHG) implementation for LP/QP on NVIDIA GPUs. cuOPT Python SDK, 2024 release.
- CVXPY — Diamond & Boyd (2016). CVXPY: A Python-Embedded Modeling Language for Convex Optimization. JMLR 17(83), 1–5.
- Hardware pricing reflects AWS on-demand list as of May 2026 (us-east-1). Spot and reserved pricing changes the absolute economics but not the throughput ratio.
- See also the engineering post The DailyRunner: orchestration, idempotency, and the kill switch for how the router slots into the morning rebalance pipeline, and Reproducibility by snapshot for how the choice of solver back-end is stored alongside every Run record.
Educational illustration · synthetic benchmark · numbers expected to refine when the production suite lands. Nothing here is investment, tax, or legal advice.