We Built a Neural Video Upscaler That Beats Classical Methods in a Browser — For $200 in GPU Costs

By Jordan Olivas, AxiomState · April 2026

Harvv is our behavioral analytics platform for e-commerce — a 6KB gzipped pixel that detects where visitors get stuck on your site. One of the friction signals we kept seeing was slow product videos killing page load times. A 720p product video at 3 Mbps takes 4-6 seconds to start playing on a mobile connection. That's enough to lose 30% of visitors before they see your product.

So we asked a question: what if you could serve a tiny 360p video and have the viewer's browser reconstruct it to 720p quality — using their GPU instead of your bandwidth?

That question turned into Ghost Stream, a neural super-resolution engine that's now a core feature of the Harvv platform. Here's what we built, what it actually achieves, and why the results surprised us.

The Problem

Traditional video delivery works on an ABR (Adaptive Bitrate) ladder. You encode every video at 360p, 480p, 720p, 1080p, and sometimes 4K. Five copies of every video. Five times the storage. The CDN serves whichever matches the viewer's bandwidth.

For an e-commerce site with 50 product videos, that's 250 encoded files. For a streaming platform with 10,000 hours of content, the storage bill alone runs into millions.

What if you could encode once at 360p and have the viewer's device do the rest?

The Approach

Ghost Stream is a 47,980-parameter neural network that upscales 360p video to 720p in real-time. It runs entirely in the browser via WebGPU — the viewer's GPU does the computation. Your server sends a file that's 6-8x smaller than native 720p, and the viewer sees quality that's measurably better than standard Lanczos upscaling.

The model is 94KB. Smaller than most favicons. It downloads once, caches in the browser, and processes every subsequent video at zero server cost.

Three things made it work:

Adversarial baseline training. Instead of optimizing every pixel equally, we only train on pixels where the model is worse than Lanczos. This focuses the model's limited budget on regions where neural reconstruction actually adds value.

Knowledge distillation with a twist. We distilled from a larger SwinIR model (910K parameters) into our tiny SPAN model. The counterintuitive finding: a generic teacher pretrained on diverse images worked significantly better than a teacher fine-tuned on our specific video content.

Training data was the biggest lever. We spent days engineering loss functions and distillation pipelines. The single biggest improvement came from switching our training data from 4 phone video clips to DIV2K (800 diverse images). That one change improved our result by +3.87 VMAF — more than adversarial training and distillation combined.

The Results

We tested on publicly downloadable Blender Open Movie sequences and real camera footage from a Google Pixel phone. Every number below is reproducible.

Standard Test Sequences

Content	Type	vs Lanczos
Big Buck Bunny	Animation	+1.76 VMAF
Elephants Dream	CGI/Surreal	+9.02 VMAF
Sintel	CGI/Dark	+2.71 VMAF
Tears of Steel	Live Action + VFX	+5.35 VMAF
Average		+3.81 VMAF

Real Camera Footage (Pixel Phone)

Content	vs Lanczos
Outdoor / high motion	+3.68 to +4.31 VMAF
Indoor / static	+2.41 VMAF
Average (4 clips)	+3.53 VMAF

Across All Quality Levels

Tested at five compression levels. Beats Lanczos at every quality level from near-lossless (+4.01) to extreme compression (+2.83).

Honest About the Metrics

We report VMAF NEG (the anti-gaming variant) alongside every result. On real camera footage, our model's NEG gap is just +0.09 wider than Lanczos — confirming genuine reconstruction, not perceptual tricks.

Temporal consistency is within 12% of Lanczos across all tested content. No visible flicker during playback.

What This Means for Video Delivery

For a streaming service with 1 million monthly viewers watching 2 hours each: estimated $378,000/year in CDN savings from 6-8x bandwidth reduction.

For e-commerce, a 360p AV1 file starts playing in under 1 second on a mobile connection. Faster video means higher conversion rates.

How It's Built Into Harvv

Ghost Stream is integrated into the Harvv platform as a video optimization feature. When Harvv's pixel detects slow-loading product videos, the Speed Suite optimizes them automatically. Developers can also access Ghost Stream directly through the API.

The Journey

This took about a week of focused R&D and roughly $200 in GPU compute costs. The single largest improvement came from training data diversity — the most basic thing we should have tried first.

Try it free: harvv.com/speed
API docs: docs.harvv.com
Benchmark code: github.com/AxiomState/ghoststream-benchmark
Contact: jordan@axiomstate.com

AxiomState builds tools that make e-commerce faster and smarter. Harvv detects behavioral friction. Ghost Stream eliminates video friction.

Download PDF Summary Try Speed Suite API Documentation