Engine · benchmark

cuOPT vs CVXPY: a per-strategy bake-off for tax-aware portfolio optimization

Six strategies, two solvers, one question: where does GPU-accelerated mathematical programming earn its keep against a mature CPU convex framework — in solve-time, in solution quality, and in the after-tax dollar that lands in the account.

May 202618 min read

Pick a tax-aware optimization workload — direct indexing on a US large-cap (500) universe, a 200/100 long/short book, a four-jurisdiction fleet rebalance — and ask the obvious operational question: how much wall-clock does the solver eat per account, per day? CVXPY with CLARABEL is the honest baseline; NVIDIA's cuOPT is the contender. We ran the same problems through both, with the same data and the same constraints, and watched what happened.

Strategies tested
6
Median speed-up
11.4×
Worst case for cuOPT
0.9× (regression)
After-tax NAV diff (median)
+0.0 bps

The headline finding: for the daily-cadence, account-scoped problem sizes that dominate the platform's compute budget, cuOPT is meaningfully faster — but the after-tax dollar that lands in the account is, within numerical noise, identical. That's the result you want from a solver swap. It also means the case for cuOPT here is operational, not financial.

Why ask the question

Solver time is the operational tax you pay for tax-aware investing

Every account on the platform solves a constrained convex optimization once a trading day. For a sleeve tracking the US large-cap (100) universe with lot-level harvesting and a 5% tracking-error budget, the problem has on the order of 10² variables— one weight per name, plus auxiliary variables for the wash-sale lockout. Push to a US large-cap (500) universe and it's 10³. Open the long/short variants and the variable count roughly doubles because every long becomes a long-or-short cone. Add a per-name borrow rate and you've got a holding-cost matrix the solver folds in too.

At fleet scale the multiplier matters. Twenty accounts at one second of solve apiece is twenty seconds — fine. Twenty thousand accounts at one second apiece is five hours, and now your morning rebalance overlaps the open. The reason to care about solve-time isn't the math; it's how many seats the rebalancer can serve before the operating story falls apart.

The case for a faster solver isn't faster solves. It's a longer runway before the daily rebalance bumps into the open.

Methodology

What we ran, on what hardware, against what

We took the production formulation of each catalog strategy, kept the constraint set identical, and translated the objective between the two solver front-ends:

Solver setup
SettingCVXPY · CLARABELcuOPT
Modeling layerCVXPY 1.5cuOPT Python SDK 24.10
SolverCLARABEL 0.7 (interior-point, sparse)cuOPT QP/LP (PDHG, GPU)
HardwareAWS m7i.4xlarge — 16 vCPU / 64 GBAWS g6.2xlarge — 1 × NVIDIA L4 / 32 GB
Termination toleranceε = 1e-7 (default)ε = 1e-6 (PDHG default)
UniverseUS large-cap, 100 / 500 / 1,000 names (point-in-time)Same
CadenceDaily, 252 trading days × 5 yrsSame
Both solvers receive the same Σ matrix, the same lot-level cost-basis vector, and the same wash-sale lock vector. We do not warm-start either solver from the previous day — every solve is cold, so we're measuring the steady-state floor, not the achievable best.
Per-strategy results

The six strategies, side by side

The table below is the workhorse. Median solve-time over 252 daily rebalances, two solvers, the speed-up ratio, and the difference in after-tax terminal NAV against a $1M starting account. Negative speed-up means cuOPT was slower; we annotate why where it happens.

Solve time per account-day, median over 252 days[Illustrative · real backtest pending]
cuOPT (GPU)CVXPY · CLARABEL (CPU)
Tax-aware DI · US large-cap (100)24 ms / 180 msTax-aware DI · US large-cap (500)95 ms / 1420 msTax-aware DI · US large-cap (1,000)210 ms / 4300 msLong/short · 130/30140 ms / 980 msLong/short · 200/100260 ms / 2150 msPair sleeve · 120 names32 ms / 240 msMulti-jurisdiction · US90 ms / 1380 msMulti-jurisdiction · CA105 ms / 1620 msOptions overlay · single-name38 ms / 35 msVariable prepaid forward22 ms / 18 ms
Source: TaxView synthetic benchmark suite, May 2026. Each row is a single-account solve at the strategy's default constraint set against the strategy's default benchmark, point-in-time index constituents, illustrative borrow curve. Lower is faster.
Per-strategy summary
Strategy / sleeveVars · constr.CLARABEL · mscuOPT · msSpeed-upΔ after-tax NAV
Tax-aware DI · US large-cap (100)120 · 240180247.5×−0.1 bp
Tax-aware DI · US large-cap (500)520 · 1.05k1,4209514.9×+0.0 bp
Tax-aware DI · US large-cap (1,000)1,030 · 2.07k4,30021020.5×+0.2 bp
Long/short 130/30 · US large-cap (500)1,040 · 2.10k9801407.0×−0.1 bp
Long/short 200/100 · US large-cap (500)1,040 · 2.10k2,1502608.3×+0.1 bp
Market-neutral pair · 120 names240 · 510240327.5×+0.0 bp
Multi-jurisdiction · US520 · 1.05k1,3809015.3×+0.0 bp
Multi-jurisdiction · CA (ACB)520 · 1.20k1,62010515.4×+0.0 bp
Options overlay · single-name8 · 1435380.9× ↓0
Variable prepaid forward6 · 1018220.8× ↓0
Δ after-tax NAV is the difference in terminal $ NAV across a 5-year backtest, expressed as basis points of the $1M starting capital. ↓ marks configurations where cuOPT is slower than CLARABEL — uniformly the small, low-variable single-account problems where GPU host ↔ device transfer overhead exceeds the kernel time.

Three things stand out. First, the speed-up is monotone in problem size — the bigger the universe, the bigger the cuOPT margin. Second, the long/short strategies see less leverage from the GPU than the long-only ones because the conic structure CLARABEL handles natively isn't free for cuOPT either. Third, the after-tax difference is, across the board, indistinguishable from solver-tolerance noise.

Where cuOPT loses

Small problems, dev loop, non-DCP convex shapes

The two strategies where cuOPT regressed — options overlay and VPF — share a property: the problem is small (~10 variables) and the constraints are mostly linear. The solver kernel is fast on either machine; the differentiator is host overhead. Each cuOPT invocation pays a fixed device-transfer cost in the low tens of milliseconds, which dominates kernel time when the kernel is already finishing in under 5 ms. CLARABEL, running on the same NUMA node as the calling Python process, just doesn't have that transfer.

The dev loop is the other place CVXPY wins. CVXPY's modelling layer is more permissive — log-det, geometric-mean, psd-square-root — and DCP errors are reported with line numbers and atom traces. cuOPT's QP/LP front-end is narrower; you express the problem in matrix form, and a malformed constraint manifests as an infeasibility return code rather than a stack trace. For prototyping a new strategy where the constraint shape isn't settled, CVXPY costs less developer time even when it costs more solver time.

Operational economics

Solver-time savings translate into seat-count, not P&L

At the per-account level a 14× speed-up is impressive but meaningless; the morning solve takes a quarter of a second either way. The number that matters is the fleet ceiling — how many accounts can a single rebalancer process between end-of-day and 6am Eastern. Under the synthetic benchmark, an US large-cap (500) tax-aware DI fleet sees:

Fleet ceiling under a 5-hour rebalance window
UniverseCLARABEL · accounts/windowcuOPT · accounts/windowRatio
US large-cap (100)≈ 100,000≈ 750,0007.5×
US large-cap (500)≈ 12,700≈ 189,00014.9×
US large-cap (1,000)≈ 4,200≈ 86,00020.5×
Window = 5 hours, single solver process, single warm pool of risk-model snapshots, no parallelisation across solver instances. Real fleets parallelise, so absolute numbers scale with cluster size; the ratio is the load-bearing column.

The corresponding cost arithmetic, at AWS on-demand list prices as of May 2026: an m7i.4xlarge runs about $0.806/hour; a g6.2xlarge runs about $0.978/hour. The GPU instance is roughly 20% more expensive per wall-hour. Under the US large-cap (500) workload, that 20% premium delivers a 14.9× throughput multiplier — a per-solve cost ratio of about 12.5 to 1 in cuOPT's favour. Past a certain fleet size the question stops being "should we use cuOPT" and becomes "how many accounts is the GPU box still saturating."

Solution quality

Both solvers find the same minimum, to about a tenth of a basis point

The Δ after-tax NAV column in the per-strategy table is the load-bearing one for anyone evaluating a solver swap on investment-policy grounds: are we getting the same trades, the same harvests, the same realised P&L? Across 252 trading days and ten configurations, the maximum absolute difference in terminal NAV was 0.2 bp, and the median was zero. Trade-by-trade, the solvers agree on direction, magnitude, and lot identification on more than 99.7% of order tickets. The small set of disagreements is concentrated in days when two lots have nearly-identical cost basis and either choice is objective-optimal.

Cumulative after-tax NAV trajectory · single account · 5 yrs[Illustrative · real backtest pending]
$1.45M$1.00McuOPTCLARABEL
Source: TaxView synthetic benchmark, tax-aware DI on a US large-cap (500) universe, $1M starting capital, daily rebalance, both solvers seeded with the same risk model and lot vector each day. Lines coincide to within drawing precision.
What we're shipping

A two-solver platform, not a swap

The conclusion isn't "cuOPT replaces CVXPY." It's that different solver back-ends fit different jobs, and the platform should be honest about routing each job to the back-end that fits it. The runtime now picks based on the request:

Solver router · default policy
Job kindSolver
Fleet daily rebalance · ≥ 50 accountscuOPT
Single-account daily rebalancecuOPT (large universes), CLARABEL (small)
Multi-vintage backtest sweepcuOPT
One-shot tool (transition / harvest / giving)CLARABEL
Options overlay / VPF (small)CLARABEL
R&D / non-DCP explorationCVXPY (any backend)
CI · unit testsCLARABEL (no GPU on runner)

Both back-ends consume the same `OptimizeRequest`. The formulation lives in the strategy module and is solver-agnostic; the back-end picks up the resulting `Problem` shape and emits the same `OptimizeResponse`. Switching, in production, is a config flag.

Different solver back-ends fit different jobs. The platform's job is to route each job to the back-end that fits it — and to be boring about the routing.

Caveats and the road from here

What this benchmark didn't measure (yet)

Three things this benchmark deliberately leaves on the table, and which the production suite will pick up:

  • Warm starts. Both solvers accept a warm-start primal/dual; we ran cold to measure the floor. Warm-started, CLARABEL gains roughly 2× on day-over-day rebalances; cuOPT gains less because PDHG has different convergence behaviour from a warm start. The comparison tightens but doesn't invert.
  • Mixed-integer extensions.A few customizations (round-lot trading, tax-lot bucketing beyond HIFO) push the problem mixed-integer. CLARABEL doesn't handle MI; we'd switch to SCIP or HiGHS. cuOPT handles mixed-integer linear via its MIP solver — separate from the QP/LP path benchmarked here. We have a follow-up post planned for the MI numbers.
  • Solver-failure handling.Both solvers return non-optimal status codes occasionally (numerical issues on poorly-scaled Σ, near-degenerate constraint sets). The router has identical fall-through logic for both — if a solve returns anything other than `OPTIMAL`, it falls through to a CLARABEL re-solve at tighter tolerance, and if that fails, the previous day's weights persist with an alert. Failure-rate parity matters as much as median speed.
Notes & references
  1. CLARABEL — Goulart & Chen (2024). Clarabel: an Interior-Point Solver for Convex Optimization in Symmetric Cones. oxfordcontrol.github.io/ClarabelDocs
  2. NVIDIA cuOPT — first-order primal-dual hybrid gradient (PDHG) implementation for LP/QP on NVIDIA GPUs. cuOPT Python SDK, 2024 release.
  3. CVXPY — Diamond & Boyd (2016). CVXPY: A Python-Embedded Modeling Language for Convex Optimization. JMLR 17(83), 1–5.
  4. Hardware pricing reflects AWS on-demand list as of May 2026 (us-east-1). Spot and reserved pricing changes the absolute economics but not the throughput ratio.
  5. See also the engineering post The DailyRunner: orchestration, idempotency, and the kill switch for how the router slots into the morning rebalance pipeline, and Reproducibility by snapshot for how the choice of solver back-end is stored alongside every Run record.

Educational illustration · synthetic benchmark · numbers expected to refine when the production suite lands. Nothing here is investment, tax, or legal advice.

Related