ArtifactNet research · Part 6 · 3/3 · Jun 2026

Beyond SOTA

Negative results, evaluation leakage, and what we measured wrong

Intrect Research 2026-06-30 ResearchArtifactNet

A research log from after ArtifactNet v9.5. This is the story of two months in which we set out to tackle the "remaining weaknesses" after hitting SOTA, only to end up questioning our own evaluation methodology. We've kept the failed experiments and the traps we found alongside the wins — because we believe that's the more valuable record in this field.

TL;DR


1. Starting Point: Where v9.5 Stood

ArtifactNet is a forensic framework for detecting AI-generated music (Suno, Udio, Stable Audio, Riffusion, MusicGen, and so on). The core pipeline looks like this:

audio → STFT → ArtifactUNet (residual extraction) → HPSS (harmonic/percussive separation)
     → 7-channel forensic features → ResidualCNN → per-segment P(AI) → per-song median verdict

At just 4.2M parameters in total, it consistently outperforms large transformer-class baselines.

System Params SONICS F1 FPR
ArtifactNet v9.5 4.2M 0.9993 0.086%
SpecTTTra (α-120s) 18.7M 0.8874 17.97%
CLAM (MoM) 194M 0.7652 67.16%

On top of that, we recorded an ArtifactBench 4-way (four-codec evaluation) F1 of 0.9861 and a MoM 40K full F1 of 0.9832.

The heart of the design is residual physics. AI generators internally pass through neural codecs such as RVQ (Residual Vector Quantization), leaving behind characteristic quantization traces ("RVQ ghosting"). ArtifactUNet isolates only these residual components, and the CNN learns the distributional consistency of the residual. AI music has abnormally high consistency here.

The two "remaining weaknesses" we perceived at the time were:

  1. udio detection rate of 85–91% — other generators were at 99%+, but udio alone was low.
  2. hard-real false positives — false positives occurred on difficult genuine recordings (lo-fi, LP, SoundCloud uploads).

The rest of this post is the record of what happened when we set out to attack (1).


2. Attacking the Weakness: Three Axes for Improving Residual Extraction

To solve the udio weakness, we used an external deep-research run (adversarial cross-validation across 105 agents) to derive three hypotheses and implemented all of them.

P1. Real-pass Decoder Augmentation

Hypothesis: If we pass genuine recordings through several neural codecs (Encodec, DAC, etc.) to create training pairs, we cover the traces of diverse decoder families and improve udio generalization.

Result: udio detection rate 85% → 85%, no change. The cause was that transfer between decoder families is essentially zero. Since udio uses an undisclosed proprietary decoder, no amount of training on Encodec/DAC transferred.

P2. Deconvolution Peak Fingerprint

Hypothesis: Borrowing from ISMIR 2025 research showing that the stride structure of transposed convolution leaves deterministic peaks in the spectrum, we use these peaks as an auxiliary feature.

Result: It was 99%+ in-distribution, but collapsed to 25–40% on unseen udio. It was weak in exactly the same spot where our main model is weak. It failed to provide independent complementarity.

P3. Quiet-Segment Channel (9-channel log-ratio)

Hypothesis: The segments udio misses were on average 2.6 dB quieter. In a multiplicative-mask structure, the signal in quiet segments can't get amplified, so adding a volume-invariant log-ratio channel (mel(residual) − mel(original)) should restore it.

Result: Training validation F1 was 0.9949 and looked all but perfect. Yet on held-out it collapsed to a 67% detection rate, 30% on udio. The channel had memorized the training distribution and failed to generalize.

All three approaches failed. At this point we stopped and asked: "Is changing the residual representation itself a dead end?"


3. What the Failures Told Us

A negative result is information in its own right. Taken together, the three failures reverse-reveal how ArtifactNet works:

This suspicion carries into the next section.


4. Questioning the Evaluation: Data Leakage

While evaluating a next-generation CNN candidate, we discovered widespread leakage between the training manifest and the benchmarks.

Benchmark source Leakage ratio
suno CDN eval set 193 / 200
udio CDN eval set 193 / 200
real recordings (YouTube hardneg, etc.) most of them

In other words, a large part of that "validation F1 of 0.993" was not generalization but memorization. The songs in the eval set had already appeared in training.

To fix this, we reproduced the training pipeline with the same seed to extract the exact set S of files that actually appeared in training, then subtracted S from the live data to construct a leak-free clean held-out set. In the process, we confirmed a 100% match with the training logs.

Lesson: Leaked evaluation doesn't just inflate strengths. As we'll see later, it also distorts weaknesses.


5. The Next-Generation Model and the Hard-Real Benchmark

On the leak-free clean held-out set, we made a fair comparison between the next-generation CNN candidate (fine-tuned on CDN data) and v9.5.

Model TPR FPR F1
Next-gen candidate 99.39% 1.36% 0.9931
v9.5 99.06% 9.29% 0.9695

Interestingly, the key gain was not in AI detection rate but in reducing false positives (FPR) on genuine recordings. In particular, accuracy on LP and vintage-style real recordings rose from 78% to 97%. This appears to be the result of strengthening the "genuine recording" representation by adding CDN's MP3/Opus real recordings to training.

Here we built one more thing — a multi-source hard-real benchmark (9 sources, 3,050 songs: Jamendo, FMA, YouTube, SoundCloud, lo-fi, LP, etc.). It's a stress test that removes single-source bias.

Model hard-real FPR
Next-gen candidate 15.57%
v9.5 32.25%

Cut in half, but still in the 15% range. SoundCloud amateur uploads, FMA, and lo-fi emerged as the residual weaknesses. This — to give it away early — was the real weakness.

On top of this, we introduced a codec-TTA (test-time augmentation) operating point. By converting the input to MP3/AAC/Opus and blending the results, we made the verdict consistent even when the same song differs only in format.


6. Deployment: Bringing codec-TTA to Production

There were traps in the process of reflecting the research gains in the actual service too. The production endpoint had diverged in version from the main code line — the April image with the batch API and the June code with the latest detection improvements were separate branches. Simply swapping the tag would break the batch API.

The solution was to overlay the June code on top of the April batch handler (preserving the batch API while reflecting the latest CNN, codec-TTA, and loudness-weighted verdict). During deployment verification, we also caught and fixed a GPU-only runtime bug (a tensor-handling error that CPU tests didn't catch).

As a result, production was updated to the latest detection performance while maintaining batch API compatibility.


7. A Side Branch: A Dedicated de-artifact Network (RVQ Ghosting Audio Restoration)

Separately from detection, we pursued research that uses the same residual physics in the opposite direction — a tool that removes RVQ ghosting from AI recordings to improve audio quality. (To be clear, the goal is not detection evasion but audio quality improvement. The actual motivation was user feedback that the ghosting in the stem multitracks Suno provides is severe.)

The core model is ComplexArtifactUNet. It reuses the backbone of the detection ArtifactUNet but handles phase as well, with complex (real/imaginary) input-output plus a complex ratio mask.

Two technically interesting points:

Another lesson: audio quality was driven more by DSP post-processing than by the model. The improvement margin of the discriminative neural network plateaued at a certain level, and the actual perceptible gains came when we combined proven signal processing such as resonance suppression (above 2.5 kHz) and automatic high-frequency correction ("much improved," per listening evaluation).


8. The Twist: Even the Weakness Was an Illusion

We return to the suspicion we left hanging in Section 3. If P3's structural fix failed because the diagnosis was wrong, then we had to re-measure the udio "weakness" itself, leak-free.

The problem was the data. The existing udio eval set overlapped entirely with the training pool (leakage). So, using our in-house crawling infrastructure, we freshly collected 454 new udio songs, cross-checked them against the training database by per-song ID, and excluded 113 leaked songs to build a 228-song leak-free held-out set.

Model udio fresh held-out detection rate Misses
v9.5 97.8% 5 / 228
Next-gen candidate 95.6% 10 / 228

The udio weakness never existed in the first place. On a true leak-free held-out set, v9.5 detects udio well at 97.8%. The previous "85–91%" was a measurement artifact created by a particular leaked sample.

The implication is weighty. The three experiments we wrestled with all through June (P1, P2, P3) were — attempts to solve a problem that didn't exist. The alternative hypothesis of "not enough data" was rejected along with it, since just 165 training songs yielded a fresh 97.8%.


9. Lessons and What's Next

Leaked evaluation distorts both strengths and weaknesses. It inflates strengths (validation F1 of 0.993) and fabricates weaknesses (udio 85%). The most expensive lesson we earned is methodological — before claiming a new model or a weakness, always re-measure on a leak-free held-out set.

Negative results are not to be thrown away. The three failures proved that the model learns the general properties of RVQ rather than decoder fingerprints, and ultimately led us to the more fundamental problem of evaluation leakage.

The next direction is now clear. The real weakness is not udio but false positives on genuine recordings (hard-real FPR of 15.57%). SoundCloud amateur uploads, FMA, lo-fi — cases of human-made music mistaken for AI. The next quarter focuses on intensively collecting this hard-real distribution to reduce false positives.

SOTA wasn't the end; it was the beginning of having the room to look at what we'd been measuring wrong.


ArtifactNet Research Team · June 2026

← Part 5: Toward Real-Time All posts

Try the detector on your own tracks

Upload any file or paste a URL — the same forensic pipeline described in this series.