ArtifactNet research · Part 5 · 2/3 · May 2026

Toward Real-Time

Distillation and runtime — light ≠ fast

Intrect Research 2026-05-28 ResearchArtifactNet

Part 5 of the ArtifactNet research journey. This is the record of May 2026, when we tried to move SOTA detection accuracy from 4-second batch inference to streaming real-time processing. We shrank the model and it got slower; we swapped runtimes 28 ways and hit a hardware floor; and we learned the expensive lesson that a "real-time model" is not the same thing as a "detection model." Here is the engineering of forcing an accuracy model into real-time constraints — successes and failures, written plainly.

TL;DR


1. The Starting Point: Why "Real-Time" Was Needed

ArtifactNet's detection pipeline is fundamentally offline.

audio → STFT → ArtifactUNet (residual extraction) → HPSS → 7-channel forensic features
     → ResidualCNN → per-segment P(AI) → per-track median verdict

To judge one track we extract seven 4-second chunks and run them as a single batch. At 0.58s per track that is plenty, and since the user just wants the result, latency was never the issue.

The problem came from the opposite-direction twin research. Using the same residual physics to remove RVQ ghosting from AI audio yields a quality-restoration tool (de-artifact). And we wanted that as an audio plugin (VST3). A plugin must process and return each 256/512-sample block the host throws at it on the spot — it cannot see future frames (causal), and it cannot wait 4 seconds.

In other words, the same ArtifactUNet backbone had to live in two worlds:

Detection de-artifact real-time
Processing unit 4s chunk batch tens-of-ms block streaming
Future frames allowed (bidirectional) forbidden (causal)
Latency irrelevant within hundreds of ms
Output P(AI) restored waveform

May's work was the engineering of bridging that gap. And in the process, the intuition that "a smaller model is faster" was broken three times.


2. First Attempt — ArtifactUNetLite, Lightweight ≠ Fast

The first design was ArtifactUNetLite: 1.85M params (half of the 3.6M Teacher), a multi-rate architecture handling 44.1/48/96 kHz in a single model, and causal streaming for the plugin.

Several design decisions were nailed down explicitly:

All 8 unit tests passed — the multi-rate hop stayed consistent at 2.90ms across 44.1/48/96, residual+clean reconstructed the input to 8.94e-08 error, and causality held.

Then the speed measurement was a shock. Lite ran 2.5× slower than the existing detection UNet on CPU — despite halving the parameters.

Digging in, the cause was clear:

  1. hop=128 means 4× more frames per chunk (T=200 vs 48). Frame count, not parameters, dominated compute.
  2. Every stage's activation tensor stayed near-constant at C×F×T ≈ 1.7M cells. The existing detection UNet halves the tensor per stage (798k→98k), keeping cumulative cost low; Lite forbade time-axis downsampling, so a large T persisted all the way through.
  3. Parameter reduction saves weight memory, not FLOPs. Even splitting T into 50–100 and calling multiple times yielded only a 24% improvement (1085ms→824ms).

First lesson: parameter count is not a proxy for speed. Real-time speed comes from (a) reducing frame count with a large hop, and (b) tapering channels toward the decoder to lower cumulative cost — Lite gave up both.

We did not discard Lite. The single multi-rate model, zero-lookahead causal streaming, and small weight memory (mobile/storage friendly) remained genuine value. But we nailed down that it would never again be proposed as an "RT acceleration" path.


3. ArtifactUNetRT Series — Short-Frame Causal, but a Different Language for Detection

Next was the ArtifactUNetRT series for true real-time removal. Instead of 4-second chunks, it streams short frames:

Name T frames Chunk length Use
RT8 8 ~93ms lowest latency
RT16 16 ~186ms recommended low-latency
RT24 24 ~278ms
RT48 48 ~557ms standard mode

Two key techniques:

The RT series uses N_FFT=1024 (117K params) — a small spectral input optimized for low latency. It was excellent for VST3 real-time removal.

But here we fell into a trap. "Couldn't we just use this fast RT model in the detection pipeline too?" We extracted its residual and pushed it through the existing detection CNN — and CNN compatibility collapsed to 30%.

The cause was simple and fundamental. The detection CNN was trained on the N_FFT=2048 spectral basis. The RT model uses N_FFT=1024, a different input space entirely. With half the frequency resolution in the residual, the CNN sees a distribution it has never encountered. Two models with different input spaces could not even be compared directly.

Second lesson: a spectral basis changed for real-time is incompatible with downstream models. Reducing N_FFT for speed was a removal-only choice that could not cross over into detection.


4. ArtifactUNetFast — Teacher Distillation, the Drop-In Answer

The third model became the bridge between the two worlds. ArtifactUNetFast is a 534K model, 6.7× smaller than the Teacher (3.6M), that preserves N_FFT=2048. With the same spectral basis it drops straight into the detection pipeline, while also supporting causal streaming.

Item Spec
Parameters 534K (6.7× smaller than Teacher)
Architecture CausalDSConv + CausalCLN, 4-level U-Net, base_channels=24
Input (B, 2, 1025, T) — cat([H_mag, P_mag]), single forward (Teacher does two)
Output (B, 2, 1025, T) — [mask_H, mask_P] ∈ [0, 0.5]²
Training knowledge distillation from frozen Teacher
Loss 1.0 × MSE(mask) + 0.1 × spectral_convergence
Data AIME + Jamendo + lo-fi hiphop (12K files, 60K chunks, SR pool 44.1/48k)

The core of the distillation is imitating the Teacher's mask while bundling the input into 2 channels so H/P are processed in a single forward (the Teacher runs H and P twice). After converging to best loss 0.06645, we compared against the Teacher on 10 tracks:

Metric Value
Mask Pearson H=0.84, P=0.86 (strong correlation)
Mask MAE H=0.080, P=0.073
Mask distribution Student 0.28 vs Teacher 0.31
Spectral Convergence 0.38 (residual character preserved)

The mask is somewhat coarser than the Teacher's but the distribution and character are preserved — residual extraction works correctly.

The speed was the headline. On PyTorch it's similar to the Teacher, but moving to ONNX Runtime changes everything:

Model ORT CPU 4-thread p50 RTF Total latency
Teacher (PyTorch, 2× forward) ~998ms 0.25 ~4s
Fast (PyTorch, 1× forward) ~1066ms 0.27 ~4s
Fast ORT T=16 18.5ms 0.10 204ms
Fast ORT T=24 27.8ms 0.10 306ms

A 50× speedup on ORT vs PyTorch. At T16, chunk (186ms) + inference (18.5ms) sums to 204ms total latency — inside the real-time plugin budget.

We exported three artifact forms — raw ONNX (for ORT runtime), a CausalCLN surgery version (Rust/tract-only custom op), and an onnxsim-optimized version. Note the surgery version fails session creation on ORT (custom op unsupported), so it is Rust/tract-only. This fork leads into the runtime war of the next section.


5. The Runtime War — From tract to TensorRT, Exhaustively

With the model fixed, the question became which runtime to run it on. To shave de-artifact real-time-removal latency (RT48 stereo), we exhausted nearly every inference engine and setting.

CPU ladder

Stage RTF (stereo)
tract sequential (start) 0.416
tract parallel L‖R 0.218
ORT FP32 4‖4 migration 0.031
ORT config tuning 0.029
CausalCLN→LayerNorm ONNX surgery 0.023
theoretical minimum (custom op needed) ~0.018

Simply moving from tract to ONNX Runtime improved RTF from 0.416 to 0.031 — an 18× gain (tract's sequential execution vs ORT's MLAS optimization plus L/R session parallelism). Surgery took it to 0.023. The absolute CPU floor (Ryzen 5800X, AVX2 FP32) was stereo 11.9ms, RTF 0.022.

GPU ladder

Setting Latency (p50) RTF vs CPU
CPU stereo 4‖4T 12.2ms 0.0223
CUDA seq L→R 3.9ms 0.0072 3.1×
CUDA Graph FP16 2.52ms 0.0046 4.8×
TRT FP16 seq 1.71ms 0.0031 7.1×
TRT FP16 batch=2 1.44ms 0.0026 8.5×

The GPU (RTX 3060) final floor was TensorRT FP16 batch=2 at 1.44ms (RTF 0.0026). Bundling stereo L+R into the batch dimension of a single forward was always faster than multi-stream (TRT's batch fusion allocates SM resources optimally). Forcing LayerNorm to FP16 (disabling the FP32 fallback) added another 5%.

The graveyard of alternatives (honestly)

Paths that looked fast but failed or regressed:

After exhausting all 28 ORT SessionBuilder options, we concluded that 11.5–12.3ms is the absolute minimum for this hardware. Reducing it further would require model retraining (shrinking) or new hardware. This is not a defeat but a boundary established — knowing what's possible lets you decide what to give up above it.


6. CausalCLN → LayerNorm Surgery — Same Numbers, Faster

A note on the CausalCLN→LayerNorm surgery that recurred throughout the runtime ladder. It was the smallest yet cleanest optimization.

The training-time CausalChannelLayerNorm is a custom op that computes (C,F) statistics per frame. It guarantees causality, but it does not map cleanly to standard ONNX ops in the inference graph. So we operated on the graph, swapping it for a numerically equivalent standard LayerNorm + transpose pattern.

The key point is that the output is bit-for-bit identical while only the runtime accelerates — because ORT recognizes the standard LayerNorm pattern as a fused kernel. Re-validated on ORT 1.24.x, surgery gave a measured +3.69ms gain; in the profile LayerNorm took 20.1% of the total and the surgery-created Transpose took 16.8% — yet it was a net win.

An interesting counterexample: implementing the same LayerNorm directly as a Rust custom op actually regressed mono by 1.3ms. For the small T=48 loop, ORT's MLAS fused implementation beat the loop-transposed Rust. "A hand-written kernel is always faster" was also a false intuition.


7. The Detection Pipeline's Answer — Teacher ONNX

While real-time removal forked into RT/Fast, we also checked whether the detection pipeline itself could move to ORT. The answer was to export the Teacher directly to ONNX.

Item Result
ORT 4-thread mono 275.8ms (<500ms target met)
CNN verdict agreement 100% (9/9)
Numerical error max 2.56e-06
Parameters 3.6M
ONNX nodes 76 (simple UNet structure)

Preserving N_FFT=2048 gave 100% verdict agreement with the detection CNN (numerical error 2.56e-06). This is the decisive difference from the RT model:

Model ORT speed CNN compat Use
RT (N_FFT=1024, 117K) 12.3ms stereo 30% VST3 only
Teacher (N_FFT=2048, 3.6M) 275.8ms mono 100% detection pipeline

For detection, the answer is the large Teacher, not the small RT model — the criterion was compatibility, not speed.


8. The Most Expensive Lesson — A "Real-Time Model" Is Not a "Detection Model"

This is the heart of the article. Throughout May there was a persistent temptation to use the fast models directly in the detection pipeline. It failed three times, each for a different reason.

(1) ArtifactUNetRT — spectral basis mismatch. N_FFT=1024, so CNN compatibility was 30%. (§3)

(2) ArtifactUNetLite — parameter inversion. 1.85M, yet 2.5× slower than the detection UNet on CPU. (§2)

(3) ArtifactUNetFast — a two-layer trap.

Fast preserves N_FFT=2048, so it has none of RT's compatibility problem. That raised the natural question: "What if we retrain the detection CNN on Fast?" A residual comparison revealed two things:

UNet path TPR (AI) FPR (Real)
codec4 (current production) 98.0% 0.0%
phase2 (Fast's teacher) 100.0% 55.0%
Fast (drop-in) 98.0% 52.5%

The correct order was clear — to detect with Fast, you must first re-distill from a codec-aware teacher (codec4) and then stack the CNN on top. And if real-time streaming is not the goal, detection needs no causal constraint, so a bidirectional lightweight student wins on both accuracy and efficiency.

Third and biggest lesson: real-time constraints (causal, short hop, small N_FFT, 2-channel fusion) collide head-on with detection accuracy and speed. The ambition to use one backbone for two worlds extracted a different cost every time. The conclusion was to split roles: removal uses Fast/RT, detection uses Teacher/codec4.


9. Side Notes — Adversarial Evasion and Lightweight SOTA Comparison

Two checks rounded out the month.

Adversarial evasion test. The purpose of de-artifact is audio-quality improvement, not detection evasion. Still, as due diligence, we self-checked whether our suppression tool could be abused for evasion — passing the original and outputs at suppression strengths α=0.5/1/2/4 through the detection pipeline and measuring the change in P(AI). (Detection robustness is covered in a later part.)

Lightweight SOTA comparison. Around the same time, on a small subset (SONICS fake 150 + real 150), we matched up against an external baseline (a MERT-based 2-stage model, 174M params). The 4.2M ArtifactNet led with TPR 100%, FPR 10.7%, F1 0.949, beating the 174M model (F1 0.929), and was 4.5× faster at 0.58s/track vs 2.62s/track. Given the subset scale, the point is not the absolute numbers but the edge despite a 40× parameter gap — a reconfirmation of this series' premise that a small model is the starting point for going real-time.


10. Lessons and What's Next

Parameter count is not speed. Two inversions — Lite (2.5×) and Fast (3.44×) — pointed at the same truth: real-time speed comes from hop size, channel taper, activation tensor size, and kernel launch count. Before shrinking a model, profile what is actually slow.

Real-time optimization breaks downstream compatibility. N_FFT reduction (RT) and causal/2-channel fusion (Fast) all collided with the downstream detection CNN. To reuse one backbone for multiple purposes, you must first decide which axes to share and which to fork. We shared the spectral basis (N_FFT=2048) and forked causality and channel fusion.

Runtime exhaustion is not defeat but boundary-setting. Running tract→ORT→TVM→OpenVINO→TensorRT to pin the hardware floor (CPU 11.9ms, GPU 1.44ms) let us reasonably decide, above that floor, what to give up and what to retrain.

In the next part, we take this refined SOTA out to attack its remaining weaknesses — and end up doubting our own evaluation methodology instead.


ArtifactNet Research Team · May 2026

← Part 4: The Benchmark Saturation Trap All posts Part 6: Beyond SOTA →

Try the detector on your own tracks

Upload any file or paste a URL — the same forensic pipeline described in this series.