Completely relying on AI to extract clean vocals is still not quite realistic with current technology; in most cases, manual correction is needed. However, as long as you have enough data, you can always scrape together enough AI-extracted vocals suitable for training.
This article discusses a workflow for filtering training audio using only AI, from the perspective of an outsider. Everything used here is open-source.
Training vs. Inference Audio
My recent research on AI cover audio can be summarized as: “Training is picky about audio quality, inference is picky about pitch.”
In other words, we should treat training audio and inference audio differently.
For inference, the focus is on pitch. It’s fine as long as you handle the harmonies well; whether there is auto-tune/robotic sound is secondary.
But for training, you need almost perfect audio.
Audio Quality
First, try to filter for high-quality audio. Lossless files are best, but choosing audio that is easy to process (where background noise is easy to clean up) might actually yield better results.
To check actual audio quality, you can use alexkay/spek .
Note: Converting lossy audio to a lossless format won’t magically improve the quality. You can verify this yourself using the software.
Pitch
Training audio should ideally cover a wide pitch range. Generally, for normal speaking:
- Male: 85Hz - 180Hz
- Female: 165Hz - 255Hz
For singing, it’s theoretically recommended to cover E2 (82Hz) to C5 (523Hz). Of course, if you need falsetto, you can push the upper limit to 800Hz.
If possible, using sustained sounds like “ah~” or “oo~” can provide the model with the most complete F0 continuous features.
Most mainstream AI cover models currently use the RMVPE algorithm for pitch extraction, which theoretically works better. However, you can use praat to quickly check the pitch reference. First, install the dependencies:
| |
Then create a pitch.py file and add the following code:
| |
This script reads the pitch of all audio files in the audio folder of the current directory, outputs a summary to the terminal, and plots a pitch distribution graph in the same directory.
Place all your audio files in the audio folder, then run the command to check the pitch:
| |
Vocal Extraction
Using nomadkaraoke/python-audio-separator allows you to run quite a few models.
Install the GPU version:
| |
List available models:
| |
Currently, MelBand Roformer models usually yield the best results. In audio-separator, these models use the MDXC architecture. Here are some common parameters you can tweak for this architecture:
--mdxc_segment_size=512: Segment size. Larger values improve the model’s context understanding, theoretically leading to better results.--mdxc_override_model_segment_size: Forcefully overrides the model’s default segment size.--mdxc_overlap=8: Overlap between prediction windows (range 2-50). Theoretically, higher values produce smoother transitions.--mdxc_batch_size=4: Number of parallel processes. Adjust based on your VRAM.--mdxc_pitch_shift=0: Pitch shifting. Usually keep it at 0.
Also, when dealing with very long audio, chunking it is often faster:
| |
So, how do we choose the right model? Referencing AliceNavigator/Music-Source-Separation-Training-GUI , we can categorize models into four types: instrumental removal (vocal extraction), harmony removal, de-reverb, and others (like denoise, de-breath, etc.).
For specific model selection, look at the SDR (Signal-to-Distortion Ratio) value. Generally, the higher the better. Here are some recommended models:
Vocal Extraction (Instrumental Removal)
For extracting vocals, Roformer Model: MelBand Roformer Kim | FT 3 by unwa is generally a solid choice. Usage:
| |
The output file with the vocals tag is the extracted vocals, which you can pass to the next step.
Harmony Removal
Some songs have multiple vocalists. In this case, you need a harmony removal model to isolate the lead vocal. Typically, you’d use the karaoke model series, such as Roformer Model: MelBand Roformer | Karaoke V2 by Gabox. Usage:
| |
Once processed, the file with the vocals tag is the lead vocal, ready for the next step.
If you have a male/female duet, you can try the Roformer Model: BS Roformer | Chorus Male-Female by Sucial model (file name: model_chorus_bs_roformer_ep_267_sdr_24.1275.ckpt).
De-Reverb
If you’re making a cover, you can use Roformer Model: MelBand Roformer | De-Reverb by anvuew.
If you’re training a model, I highly recommend the mono version: Roformer Model: MelBand Roformer | De-Reverb Mono by anvuew.
This is because current AI models train on mono audio. If you feed them stereo, the phase difference between channels might introduce artifacts/noise during training.
| |
The output with the noreverb tag is your de-reverbed file.
Other Models
For example, noise reduction: Use Roformer Model: Mel-Roformer-Denoise-Aufr33 to strip out mic hiss and ambient background noise.
| |
There is also a model to remove breath/aspiration noise: Roformer Model: MelBand Roformer | Aspiration by Sucial (file name: aspiration_mel_band_roformer_sdr_18.9845.ckpt).
Other Architectures
For noise reduction, the DeepFilterNet3 model works quite well on non-music audio. The easiest way to use it is via
Shuichi346/DeepFilterNet3-VST3
. This is a DAW plugin, so you’ll need a DAW installed. Also, the author only provides a MacOS build, so you’ll have to compile it yourself on other platforms.
For the DAW, REAPER is a solid choice. It offers an unlimited evaluation period, meaning you can technically use it for free.
To compile the VST3 plugin, you need Rust. Download
Rustup-init
, run it, and hit 1 to keep the defaults. It will ask to download Visual Studio during the setup—just go with the defaults and download everything.
Once installed, clone the repo and cd into the directory:
| |
Start compiling:
| |
Once compiled, dig through the nested folders in the target directory to find deepfilter-vst.vst3, and copy it to C:\Program Files\Common Files\VST3.
Open REAPER, drag in an audio file, click the FX button on the left, search for the plugin, and add it.
Personally, I didn’t find the results all that great—it aggressively filters out too much of the actual voice—but feel free to give it a shot.
Audio Normalization
For model training, the peak volume is best kept between -3dB and -6dB. Anything higher might cause clipping.
Since anything below -40dB will be stripped out by the “slicing” step later, you can normalize the peak volume to -3dB at this stage.
Install the dependencies:
| |
This command processes all .wav files in the current directory and saves the normalized files to the normalized folder. Make sure to create the normalized folder first.
| |
Command options breakdown:
-nt peak: Normalization Type. Here we specify peak normalization.-t -3: Target value set to -3dB.-ext wav: Output format set to wav.-o: Output directory.
This ensures all training audio peaks at exactly -3dB.
Slicing
Use flutydeer/audio-slicer to automatically split the audio, then pick the segments that sound natural.
Filter out the best-sounding clips for training. Make sure each clip is at least 2 seconds long, preferably over 4 seconds.
A total duration of 10-30 minutes is usually enough; don’t go over 2 hours. After sorting, you can check the total duration of all .wav files using this command in PowerShell:
| |
Filtering (Dataset Cleanup)
From the sliced audio, weed out any robotic/glitched sounds and keep only normal voice clips. Using foobar2000 is great for this because it lets you play back audio without any system-level enhancement or reverb. What you hear is exactly what the model will “hear.”
Download, install, and open it. For the layout, pick one with a large playlist area, like Slim View + Tabs.
Select Output Device
Press
Ctrl+Pto open preferences. Go toPlayback -> Output, and forDevice, choose one withexclusivemode.Set a Hotkey for Quick Deletion
In
Keyboard Shortcuts, add a new shortcut. Search fordeleteunderAction, select[context]->File Operations->Delete file(s), and assign a hotkey likeCtrl+D.
Now, just drag all your audio files into the playlist and start auditing.
Final Thoughts
Looking at what AI can do right now, there’s probably no need to panic about losing your job just yet. At least in the audio domain, it’s still very much in the “utility tool” phase. It boosts productivity, but it still won’t let a complete amateur like me produce a flawless, professional piece of work out of the box.
While AI will keep improving, people with experience in traditional tools still hold irreplaceable value.