SoVITS on Older Machines
DIY stands for “Do It Yourself.” In a world where short videos show you how to do all sorts of wonderful and neat things it often only touches on these things in the most basic and superficial way. Often never showing the true technical challenges of how to make these things work. Are you tired of all those influencer YouTube channels hyping "free, easy and even you can do this yourself?" That’s right people. Just watch this video or that video. Like, comment and subscribe and boom all your problems are solved. While helpful, these channels hardly ever do as they advertise. In today’s episode of DIY we are going to tackle Voice Cloning on your own personal computer without the help of web-based AI, server farms and cloud computing to do it.
Is DIY
Voice Cloning Really Worth It?
Spoiler:
Probably Not…
It all sounds so simple and exciting on the YouTube video to clone your voice (or someone else’s) without spending a dime! But as you dive in, reality hits. Those “free” tools like ElevenLabs only give you a few hundred words a month for a fancy grocery list unless you cough up the $5–15 a month for their subscription-based service. And even then, what do you get? Unless you work in business, do PowerPoint presentations two to three times a week, or do any video work that requires narration and your voice sounds like an overweight Jersey woman who smokes KOOL cigarettes and drinks a 5th of Old Granddad daily; you probably do not need to clone your own voice. But hey, you might… Being the computer nerd tinkerer that I am. That’s when I started to wonder: Can I do this myself? Could my personal computer handle it? Simple answer: it can, sort of—but not without a ton of effort, time, and frustration. The quality? Let’s just say it’s more “B-side garage demo” than say a well-produced professional “studio album.”
The Real
Power Behind Pro Voice Cloning:
Let’s get one thing straight here. Companies like ElevenLabs don’t rely on the kind of
computers you or I have at home. They use server farms—huge clusters of
insanely powerful processors and GPUs working together, optimized for machine
learning tasks. My computer? It’s got perhaps 1% of the processing power needed
to compete, if I am lucky. To match ElevenLabs, I’d need 100 brand new systems,
not like mine, networked together, working perfectly in sync on these
processing tasks. Even with a dedicated GPU, my setup took half a day to
process a single voice cloning attempt. And while it worked, the results
lacked emotion and nuance. The clones were flat, monotone, and far from the
expressive quality you’d expect from a professional system.
SoVITS:
A Powerful Tool… If You Can Handle It:
For my experiment here I used SoVITS. SoVITS (Singing Voice Conversion via
Variational Inference with Text-to-Speech) is a tool in creating clone
voices to read new content you want it to read in that voice. It allows users
to clone and transform voices, making it possible to replicate or modify any
voice for various purposes, from voiceover production to AI voice assistants. It’s
an impressive tool capable of replicating the pitch, tone, and character of a
voice. But—and it’s a big but—it’s not exactly user-friendly. To get it
running, you need a solid understanding of Python, command-line interfaces
(CLI), CMD usage and knowing how to navigate directories, paths and executable programs
inside CMD, and the patience of a saint.
Here’s The
Catch:
SoVITS is built to leverage CPUs and GPUs for machine learning. If you’re like most
people using a standard PC with integrated graphics, you’re going to hit a
wall. Onboard GPUs just don’t have the muscle for this. My NVIDIA GeForce GTX
750Ti, while far from top-of-the-line, made a noticeable difference in
processing speed and output quality. Without it? Forget it. The process
would’ve been painfully slow and the results even worse. I know, I did this
first, months ago, before purchasing an older GPU NVIDIA video card to see the
difference.
What
Those YouTubers Won’t Tell You:
Here’s the part that really grinds my gears: those
YouTube tutorials that make it look so easy? Per, the usual they are not giving
you the full picture. They skip over the hardware requirements, the endless
troubleshooting, and the hours of trial and error needed to get anything working.
And if you’re running an older computer like mine—a 12-year-old Intel i3 with
16GB of RAM—you’ll need to make even more adjustments just to get SoVITS to
run. I’ll be honest: even with my experience and a lot of help from
ChatGPT, this process was no walk in the park. I spent days tweaking code,
fixing errors, and learning Python on the fly. If you’re not already familiar
with these kinds of tools, the learning curve can feel insurmountable. However,
I am glad I did this because it gave me an understanding moving forward how
these things work and can work. Even though it wasn’t worth all my efforts I am
glad I know this and still may use generated voice clones for various projects.
The
Bottom Line:
Could you setup SoVITS and clone voices on your own computer? Sure, if you have the right
hardware, the technical know-how, and the patience to troubleshoot every step.
But should you? That’s another question entirely. It isn’t a should. It is more
like a why? The quality you’ll get from a DIY setup doesn’t hold a candle to
what professional systems can produce. For most people, the convenience of
cloud-based services like ElevenLabs outweighs the headaches of a DIY setup.
Yes, those services cost money, but they save you time and frustration—and the
results are worlds better. Unless you’re in it for the learning experience or
just love tinkering with tech, trying to do this on your own probably isn’t
worth the effort. If you want a completely free service that actually gives you
a good finished product you can try new.text-to-speech.online/
—While 100% free and can be used for commercial use I find this one to be the
best. It does not offer a way to clone your own voice or upload different
clones of voices. You can use their list, which is pretty vast. Gives you a .mp3
of whatever you type into the input box up to 10-minutes at a time. It will
even read swears, which I love. You can change the speed, the pitch and in some
instances the tone from angry, happy to sad. You can download directly from the
site. No sign ups, processing takes seconds to a minute depending on the amount
of text you copy/paste into the input field. I use this on most of my videos
now. I am using it for it for the audio portion of this essay.
Lite
Tutorial:
I am not going to give you a full-board tutorial on how to use the interface for SoVITS.
There are plenty of good ones on the Tube. That is where I got all this started.
If you want to know how to use it. Send me a message and we can go through it.
Some steps I took may work for you. They may also not work for you. It all
depends on the hardware, software and configuration of your personal computer.
I also do not work with MACs or Apple computers. I have been and always will be
a Windows-based environment guy.
Software
You May Need:
• 7-Zip or WinRAR
• Notepad++
• Audacity or other audio editing software
• Python 3.9
• SoVITS software package
• ChatGPT
• Word or any other document writing program for notes.
If you run another version of Python that is ok. Look up on how to use different versions of Python on your system. You can have a variety of versions setup on your machine. You just need to specify which one you are using. We will need to use CMD to create a virtual environment for Python to run so that the SoVITS system can run correctly. You will need to run your CMD in administrator mode. Once you have the virtual environment setup and active you should see this next your CMD (venv). You can follow along using these CMD commands in italics:
Instructions:
• Download GPT-SoVITS
• Unzip folder in a place you can navigate to easily in command prompt (CMD)
• Navigate to unzipped contents for SoVITS in windows explorer and search for the
.bat file called, go-webui.bat.
• Right-click on the file and open in either Notepad++, Notepad or whatever
program you use to edit scripting code.
• We need to modify the language because the GUI is in Chinese.
• Change the text from webui.py zh_CN to webui.py en_US and save.
• Make sure it saves as a .bat file and not .txt at the end of the file extension.
• Now when you open the Web GUI most of what you need will be in English.
• Confirm PyTorch supports your GPU.
• Install Python dependencies for SoVITS
• Update Pip
• Install requirements.txt
• Double-check Pip install
Download
Links:
• Link to SoVITS: entry.co/GPT-SoVITS
• Alt-Link to SoVITS: huggingface.co/GPT-SoVITS
• Link to Python version 3.9: python.org/downloads/release/python-390/
• Link to your PyTorch settings: pytorch.org/get-started/locally/
CMD
Commands:
Install Python dependencies for SoVITS — Update Pip:
python -m pip install --upgrade pip
Install
Python dependencies for SoVITS — Confirm PyTorch supports your GPU:
python -m torch.utils.collect_env
Install
Python dependencies for SoVITS — Confirm PyTorch supports your GPU (Alternate):
pip install torch torchvision torchaudio --index-url
https://download.pytorch.org/whl/cu118
Pytorch
Alterative Downloads — If your CUDA version is 11.8:
python -m pip install torch torchvision torchaudio --index-url
https://download.pytorch.org/whl/cu118
Pytorch
Alterative Downloads — If you don’t have CUDA or want CPU-only:
python -m pip install torch torchvision torchaudio --index-url
https://download.pytorch.org/whl/cpu
Install other
dependencies for SoVITS — requirements.txt:
Cd C:\Users\Your-PC\GPT-SoVITS
pip install -r requirements.txt
Double-check
Pip install:
Pip list
Creating
Virtual Environment:
cd C:\Users\Your-PC\Voice Clone\GPT-SoVITS\venv\Scripts\
C:\Users\Yout-PC\VoiceClone\GPT-SoVITS\venv\Scripts>activate
Opening
SoVITS Web GUI Interface:
(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS\venv\Scripts>cd C:\Users\Yout-PC\Voice
Clone\GPT-SoVITS
(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS>go-webui.bat
You will see this message open up as the go-wbui.bat opens. This may take a minute.
(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS>runtime\python.exe webui.py zh_EN
Running on local URL: http://0.0.0.0:9874
Optional
Edits of Key files:
Edit .json file for fp16_run to be false (Optional):
• If you run into a rendering error you may need to do this. SoVITS utilizes mixed-precision training, which can save memory but is not supported on GPUs with older architectures, as is mine.
• For me I had to go into this file GPT_SoVITS\configs\s2.json and edit it in notepad.
• Look for "fp16_run": true and change it to false. Then save and try to run your render again.
• Gradio kept reverting to an older version but it isn’t very important. I was still able to get it working even though Gradio was using an older version and I tried to update a few times with no success. It said it did update and it shows up in the PIP list but when I am actually rendering it reverts back to an older version. I just wanted the warning message to go away every time it ran. CMD doesn’t breakup text well for my poor vision. Newer rigs for this purpose won’t have to do this.
Now a new web browser tab should open up and you will be using the SoVITS interface from here on out. Here’s an abbreviated breakdown of the process. The mission of this essay, more or less, was to get this thing installed, configured and working properly. Took way longer than what these YouTube tutorials will tell you. The process does work. The issue is proper computer processing power. At this point you can go and find other tutorials on how to use SoVITS. It’s pretty basic. I will leave a very short guide here but you should attack this as a great exercise in computer operations. If you have a gaming rig in the $5000 dollar range you may want to do this for your audio ventures. But I’d like to mention that if you can afford a 5K computer tower, you probably know how to use that some bitch properly and all of this won’t be an issue for you. I have never personally seen one single computer tower do some of the things I see. I know they are out there or people are building their own servers out of desktops to help out. I just do not think this was a great use of the time I put into it. Granted, I just wanted to see if I could get it to work. If it works and the quality still isn’t very great, then I will chalk it up to old hardware. I cloned Leonard Nimoy, Patrick Bateman, character from the 2000 film “American Psycho,” an AI cloned-version of Patrick Bateman, and myself. The quality isn’t terrible but if I were to get serious about using my own voiceovers for stuff I’d probably go and get an ElevenLabs subscription. The below Step-by-Step Guide will be divided up and read on the audio/video version in these different voices. The rest of this blog was narrated by the AI-clone used from new.text-to-speech.online/
Step-by-Step Guide: Cloning Voices Using SoVITS Web UI
- Download SoVITS Web UI:
- Click on the first link in the video description to download SoVITS.
- After downloading, extract the files into a new folder on your computer.
- Run SoVITS Web UI:
- Open the folder where the files were extracted.
- Run the file named go web UI.
- Launch the Web UI:
- After running the file, click the Run button.
- Allow the necessary permissions when prompted.
- Wait for a moment and the SoVITS web interface will open.
- Voice Segmentation:
- Upload an audio file that is at least 1 minute long.
- Copy the path to your audio file and paste it in the corresponding field in the SoVITS interface.
- Leave the rest of the settings unchanged.
- Ensure the voice in the file has a normal speed and natural pauses; otherwise, segmentation may fail.
- Start Segmentation:
- Click the Start button to begin voice segmentation.
- A message will confirm that the segmentation process was successful.
- Check Segmentation Output:
- Open the output folder and then navigate to the slicer opt folder.
- Here, you’ll find that the voice has been divided into segments.
- Copy the path to this folder and paste it in the corresponding field in the web interface.
- Click Start again to proceed.
- Verify Text Recognition:
- After processing, a .list file will appear in the output folder under ASR op.
- Check the content of this file by opening it in Notepad.
- Copy the text and paste it into the field in the SoVITS interface.
- Click Play to listen to the segments and verify that the text was read correctly.
- Enter Model Information:
- Go to the TTS tab in the web UI.
- Enter the name of your model (e.g., "Name AI") in the first field.
- The second field will automatically detect your graphics card. Do not change anything here.
- Paste the .list file path from the ASR op folder into the left field.
- Paste the slicer opt folder path into the right field.
- Leave all other options as default.
- Click the Start Formatting button to format the data for training.
- Start Training the Model:
- Go to the Training tab.
- Click the Start SoVITS Training button.
- The training process will begin, and you’ll be able to monitor its progress through the console.
- Start GPT Training:
- Once SoVITS training is completed, click Start GPT Training to train the model further.
- Once finished, you'll receive a success message.
- Select the Model:
- Go to the 1C tab and click the Refresh button to update the model paths.
- Select the model with the highest value in both fields.
- Click the Open TTS button to open the TTS web UI.
- Generate Audio:
- Load an audio file from the slicer opt folder into the new TTS interface.
- Open the ASR opt folder and copy the text from the .list file.
- Paste this text into the input field for the TTS model.
- Set the language to English and click Start to generate the voice output.
- Final Adjustments:
- For better natural-sounding speech, adjust settings like Slice by English.
- Review the generated voice output. The voice should now sound similar to the original with high quality.
- Result:
- Listen to the final output. The cloned voice should be very similar to the original, with natural prosody and tone.
SoVITS on Older Machines
by David-Angelo Mineo
12/9/2024
2,741 Words