Getting Audio
For training, real lossless audio is highly recommended (specifically pcm_f32le WAV files). It’s best to rip them from physical albums. You can refer to this guide:
https://www.bilibili.com/opus/925630344961458181
, but skip the FLAC conversion and export directly to WAV.
This guide uses Applio (based on the RVC architecture). You’ll need about 10 to 60 minutes of clean vocals (acapella) to get good training results. Don’t exceed 2 hours. Remember, quality over quantity: Garbage In, Garbage Out!
YouTube
You can use yt-dlp to download audio from YouTube. Note that this is converted from lossy formats, so it’s not true lossless audio.
GitHub: yt-dlp/yt-dlp
Download YouTube audio in the highest quality:
| |
For longer audio (over 10 mins), you can split it using ffmpeg:
| |
This splits the audio every 10 minutes. Change
-segment_time 600to adjust the duration.
To merge them back into a single file, create a filelist.txt listing all segments:
| |
Then run this command to merge:
| |
To extract a specific segment (e.g., a 15-minute clip starting at the 10-minute mark):
| |
Other Sources
For flac, mp3, or other formats, convert them to WAV for inference using ffmpeg:
| |
Even if you convert lossy files to WAV, it doesn’t magically make them lossless. Getting WAVs ripped directly from physical albums is always best.
Vocal Extraction (Acapella)
Setting Up the Environment
First, install Miniconda. Create and activate a new environment:
| |
Install PyTorch:
| |
Install openmirlab/bs-roformer-infer :
| |
Install nomadkaraoke/python-audio-separator :
| |
Downloading Models
Run these commands to list available models:
| |
As of 2026-05-21, the output of bs-roformer-download is:
| |
Here is a quick breakdown of what these models do:
- BS Roformer SW by jarredou: Separates audio into 7 tracks (bass, drums, guitar, instrumental, other, piano, vocals).
- BS Roformer | Chorus Male-Female by Sucial: Separates male vocals, female vocals, and reverb.
- BS Roformer | Instrumental Resurrection by unwa: Instrumental HD reconstruction, great for restoring old backing tracks.
- BS Roformer | Male-Female by aufr33: Separates male and female vocals.
- BS Roformer | Vocals Resurrection by unwa: Vocal HD reconstruction (aggressive version of Revive) to restore old vocals.
- BS Roformer | Vocals Revive Series: Vocal restoration, kind of like image upscaling but for audio.
- BS Roformer | Vocals by Gabox: Standard vocal isolation. Unlike the 7-track SW model, this only separates instrumental and vocals.
- BS-Roformer-De-Reverb: Dereverb.
Download the models you need. We’ll grab two here, but since one of the built-in links is dead, we’ll download it manually.
| |
You can find the dereverb model on Hugging Face: anvuew/dereverb_bs_roformer
Download the model and its .yaml config file, then place them in the correct folder. It’s recommended to rename the .yaml file to match the model filename.
Audio Processing
We’ll do this in three steps: first, extract vocals using BS Roformer SW by jarredou, then isolate the main vocals from backing vocals using Roformer Model: MelBand Roformer | Karaoke V2 by Gabox, and finally strip reverb using Roformer Model: BS-Roformer-De-Reverb to get the final dry vocals (acapella).
You can experiment with swapping steps 2 and 3. Different tracks might require a different order for the best results.
First, create directories for the inputs and outputs of each step:
| |
Put your music in step1_input (English filenames recommended). Make sure they are in lossless WAV format.
<1> Multi-track Splitting
Use BS Roformer SW by jarredou to split the tracks. Edit the model’s .yaml file to add these settings (located in models/roformer-model-bs-roformer-sw-by-jarredou):
| |
Adjust these settings based on your hardware:
- batch_size: Scale this based on your VRAM. For 16GB VRAM, use 16.
- dim_t: Do not touch. This is the model’s trained time-domain dimension.
- chunk_size: This is missing from the downloaded config. Add it manually and set it to
352768. - num_overlap: Set between 2 and 10 depending on your VRAM.
- normalize: Keep it
false.
Start inference:
| |
Once done, copy the file ending with _vocals.wav into the step 2 input folder step2_input.
(Optional) Mixing with MSST for Better Results
Usually, using a single model is enough. But if you want the absolute best quality, you can run the audio through MSST-BSRNN to get vocals_msst.wav, then merge it with Roformer’s vocals_roformer.wav using ffmpeg:
| |
Parameter breakdown:
normalize=0: Disables dynamic adaptive envelope to prevent sudden volume or detail drops.-c:a pcm_s16le: Lossless output.
<2> Vocal Purification
Since the backing vocal extraction model uses a different architecture, we’ll use audio-separator. It will download the model automatically on the first run:
| |
After processing, Vocals contains the lead singer, and Instrumental contains the backing vocals. Move the Vocals file to the step 3 input folder step3_input.
<3> Room De-reverb
Use BS-Roformer-De-Reverb to remove room reverb from the vocals. You can adjust the .yaml config:
| |
Do not change dim_t. Adjust the other two options based on your VRAM.
Start inference:
| |
The file ending with _noreverb.wav is your final dry vocal track.
(Optional) Separating Male and Female Vocals
For male-female duets, you can split the voices first using BS Roformer | Chorus Male-Female by Sucial (download it from
Sucial/Chorus_Male_Female_BS_Roformer
):
| |
Voice Conversion
Using Applio for voice conversion (RVC architecture).
Project URL: IAHispano/Applio
Installation
Super easy to set up. Clone the repository:
| |
Go to the root directory and run run-install.bat to install.
Once installed, launch it by running run-applio.bat.
Training a Model
Go to the “Training” tab. Under “Model Settings”, enter a new “Model Name”.
In the “Preprocess” section, create a dataset and upload your clean dry vocals. Click “Preprocess Dataset” once uploaded.
Keep defaults in the “Extract” section and click “Extract Features”.
In the “Training” section, adjust “Batch Size” based on your VRAM:
- 8GB: 4 or 8
- 12-16GB: 12-16
- 24GB+: 24-32
Keep “Save Every Epoch” at the default 10.
Set “Total Epoch” to 200-300. This range usually yields the best results (around 220-250 epochs is often the sweet spot).
Agree to the terms, click “Start Training”, and then click “Generate Index”.
Inference
Once trained, go to the “Inference” tab. Choose your model. Start testing with the 200-250 epoch checkpoints, then adjust up or down.
Advanced Settings breakdown:
- Split Audio: Enable for long audio to prevent VRAM overflow. For regular 3-minute songs, leaving it disabled might yield better results.
- Autotune: Turn on for singing, keep off for spoken monologues.
- Clean Audio: Enable if needed.
Faders below:
- Pitch: Male-to-female set to +12, female-to-male set to -12, same-gender set to 0.
- Search Feature Ratio: 0.7-0.8 for singing, 0.6-0.7 for podcasts and spoken audio.
- Protect Voiceless Consonants: Around 0.33 to 0.5 for singing, 0.5 for others.
Agree to the terms and click “Convert”.
(Optional) Stereo Audio Inference
Applio only outputs mono audio, so processing stereo inputs directly sounds weird. To fix this, split the channels first, infer them separately, and then merge.
Split channels using ffmpeg:
| |
After running inference on both files, merge them back into stereo:
| |
Troubleshooting: Port Conflict
If you get a port conflict error, open app.py in the root directory and change DEFAULT_PORT = 6969 to another number.
Avoid Windows reserved ports. You can list them in PowerShell using:
| |
Mixing
Merge your converted dry vocals with the backing track isolated during the Multi-track Splitting step.
Since we isolated the backing vocals, you can mix them back into the step 1 instrumental track to create a new backing track. Then use ffmpeg to mix the dry vocals with the new backing track:
| |
Pure dry vocals can sound dry and flat. Adding some reverb makes it sound much better:
| |
Parameter breakdown for aecho=0.8:0.88:40:0.4:
0.8: In Gain (input volume going into the effect).0.88: Out Gain (total volume after the reverb effect).40: Delays (in milliseconds, simulating sound bouncing off walls).0.4: Decays (decay factor, giving it a subtle trailing tail).
These settings mimic a basic studio. For a stage/hall effect, use aecho=0.8:0.88:80:0.5. For subtle touch-ups, use aecho=0.8:0.88:35:0.25.
The above is an equal mix. You can also mix them with custom volumes:
| |
This sets the first input (vocals) to 100% volume [0:a]volume=1.0[v] and the second input (backing track) to 40% volume [1:a]volume=0.4[b].
To apply reverb and adjust the backing track volume at the same time:
| |
Conclusion
I tried about three models with this fully automated workflow. The results aren’t perfect. For some tracks, manual tuning and cleanup in a DAW are still needed for top-tier results.
Also, bs-roformer-infer used here is a bit outdated—many models are hard to find or have broken links. I’ll look into newer tools in the future and share my findings in a new post.
This guide isn’t flawless, but it’s a solid log of my learning process. After all, that’s what blogging is all about.