SoVITS on Older Machines

DIY stands for “Do It Yourself.” In a world where short videos show you how to do all sorts of wonderful and neat things it often only touches on these things in the most basic and superficial way. Often never showing the true technical challenges of how to make these things work. Are you tired of all those influencer YouTube channels hyping "free, easy and even you can do this yourself?" That’s right people. Just watch this video or that video. Like, comment and subscribe and boom all your problems are solved. While helpful, these channels hardly ever do as they advertise. In today’s episode of DIY we are going to tackle Voice Cloning on your own personal computer without the help of web-based AI, server farms and cloud computing to do it.

Is DIY Voice Cloning Really Worth It?
Spoiler: Probably Not…

It all sounds so simple and exciting on the YouTube video to clone your voice (or someone else’s) without spending a dime! But as you dive in, reality hits. Those “free” tools like ElevenLabs only give you a few hundred words a month for a fancy grocery list unless you cough up the $5–15 a month for their subscription-based service. And even then, what do you get? Unless you work in business, do PowerPoint presentations two to three times a week, or do any video work that requires narration and your voice sounds like an overweight Jersey woman who smokes KOOL cigarettes and drinks a 5th of Old Granddad daily; you probably do not need to clone your own voice. But hey, you might… Being the computer nerd tinkerer that I am. That’s when I started to wonder: Can I do this myself? Could my personal computer handle it? Simple answer: it can, sort of—but not without a ton of effort, time, and frustration. The quality? Let’s just say it’s more “B-side garage demo” than say a well-produced professional “studio album.”

The Real Power Behind Pro Voice Cloning:
Let’s get one thing straight here. Companies like ElevenLabs don’t rely on the kind of computers you or I have at home. They use server farms—huge clusters of insanely powerful processors and GPUs working together, optimized for machine learning tasks. My computer? It’s got perhaps 1% of the processing power needed to compete, if I am lucky. To match ElevenLabs, I’d need 100 brand new systems, not like mine, networked together, working perfectly in sync on these processing tasks. Even with a dedicated GPU, my setup took half a day to process a single voice cloning attempt. And while it worked, the results lacked emotion and nuance. The clones were flat, monotone, and far from the expressive quality you’d expect from a professional system.

SoVITS: A Powerful Tool… If You Can Handle It:
For my experiment here I used SoVITS. SoVITS (Singing Voice Conversion via Variational Inference with Text-to-Speech) is a tool in creating clone voices to read new content you want it to read in that voice. It allows users to clone and transform voices, making it possible to replicate or modify any voice for various purposes, from voiceover production to AI voice assistants. It’s an impressive tool capable of replicating the pitch, tone, and character of a voice. But—and it’s a big but—it’s not exactly user-friendly. To get it running, you need a solid understanding of Python, command-line interfaces (CLI), CMD usage and knowing how to navigate directories, paths and executable programs inside CMD, and the patience of a saint.

Here’s The Catch:
SoVITS is built to leverage CPUs and GPUs for machine learning. If you’re like most people using a standard PC with integrated graphics, you’re going to hit a wall. Onboard GPUs just don’t have the muscle for this. My NVIDIA GeForce GTX 750Ti, while far from top-of-the-line, made a noticeable difference in processing speed and output quality. Without it? Forget it. The process would’ve been painfully slow and the results even worse. I know, I did this first, months ago, before purchasing an older GPU NVIDIA video card to see the difference.

What Those YouTubers Won’t Tell You:
Here’s the part that really grinds my gears: those YouTube tutorials that make it look so easy? Per, the usual they are not giving you the full picture. They skip over the hardware requirements, the endless troubleshooting, and the hours of trial and error needed to get anything working. And if you’re running an older computer like mine—a 12-year-old Intel i3 with 16GB of RAM—you’ll need to make even more adjustments just to get SoVITS to run. I’ll be honest: even with my experience and a lot of help from ChatGPT, this process was no walk in the park. I spent days tweaking code, fixing errors, and learning Python on the fly. If you’re not already familiar with these kinds of tools, the learning curve can feel insurmountable. However, I am glad I did this because it gave me an understanding moving forward how these things work and can work. Even though it wasn’t worth all my efforts I am glad I know this and still may use generated voice clones for various projects.

The Bottom Line:
Could you setup SoVITS and clone voices on your own computer? Sure, if you have the right hardware, the technical know-how, and the patience to troubleshoot every step. But should you? That’s another question entirely. It isn’t a should. It is more like a why? The quality you’ll get from a DIY setup doesn’t hold a candle to what professional systems can produce. For most people, the convenience of cloud-based services like ElevenLabs outweighs the headaches of a DIY setup. Yes, those services cost money, but they save you time and frustration—and the results are worlds better. Unless you’re in it for the learning experience or just love tinkering with tech, trying to do this on your own probably isn’t worth the effort. If you want a completely free service that actually gives you a good finished product you can try new.text-to-speech.online/ —While 100% free and can be used for commercial use I find this one to be the best. It does not offer a way to clone your own voice or upload different clones of voices. You can use their list, which is pretty vast. Gives you a .mp3 of whatever you type into the input box up to 10-minutes at a time. It will even read swears, which I love. You can change the speed, the pitch and in some instances the tone from angry, happy to sad. You can download directly from the site. No sign ups, processing takes seconds to a minute depending on the amount of text you copy/paste into the input field. I use this on most of my videos now. I am using it for it for the audio portion of this essay.

Lite Tutorial:
I am not going to give you a full-board tutorial on how to use the interface for SoVITS. There are plenty of good ones on the Tube. That is where I got all this started. If you want to know how to use it. Send me a message and we can go through it. Some steps I took may work for you. They may also not work for you. It all depends on the hardware, software and configuration of your personal computer. I also do not work with MACs or Apple computers. I have been and always will be a Windows-based environment guy.

Software You May Need:
   • 7-Zip or WinRAR
   • Notepad++
   • Audacity or other audio editing software
   • Python 3.9
   • SoVITS software package
   • ChatGPT
   • Word or any other document writing program for notes.

If you run another version of Python that is ok. Look up on how to use different versions of Python on your system. You can have a variety of versions setup on your machine. You just need to specify which one you are using. We will need to use CMD to create a virtual environment for Python to run so that the SoVITS system can run correctly. You will need to run your CMD in administrator mode. Once you have the virtual environment setup and active you should see this next your CMD (venv). You can follow along using these CMD commands in italics:

Instructions:
   • Download GPT-SoVITS
   • Unzip folder in a place you can navigate to easily in command prompt (CMD)
   • Navigate to unzipped contents for SoVITS in windows explorer and search for the .bat file called, go-webui.bat.
   • Right-click on the file and open in either Notepad++, Notepad or whatever program you use to edit scripting code.
   • We need to modify the language because the GUI is in Chinese.
   • Change the text from webui.py zh_CN to webui.py en_US and save.
   • Make sure it saves as a .bat file and not .txt at the end of the file extension.
   • Now when you open the Web GUI most of what you need will be in English.
   • Confirm PyTorch supports your GPU.
   • Install Python dependencies for SoVITS
   • Update Pip
   • Install requirements.txt
   • Double-check Pip install

Download Links:
   • Link to SoVITS: entry.co/GPT-SoVITS
   • Alt-Link to SoVITS: huggingface.co/GPT-SoVITS
   • Link to Python version 3.9: python.org/downloads/release/python-390/
   • Link to your PyTorch settings: pytorch.org/get-started/locally/

CMD Commands:
Install Python dependencies for SoVITS — Update Pip:
python -m pip install --upgrade pip

Install Python dependencies for SoVITS — Confirm PyTorch supports your GPU:
python -m torch.utils.collect_env

Install Python dependencies for SoVITS — Confirm PyTorch supports your GPU (Alternate):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Pytorch Alterative Downloads — If your CUDA version is 11.8:
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Pytorch Alterative Downloads — If you don’t have CUDA or want CPU-only:
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Install other dependencies for SoVITS — requirements.txt:
Cd C:\Users\Your-PC\GPT-SoVITS

pip install -r requirements.txt

Double-check Pip install:
Pip list

Creating Virtual Environment:
cd C:\Users\Your-PC\Voice Clone\GPT-SoVITS\venv\Scripts\
C:\Users\Yout-PC\VoiceClone\GPT-SoVITS\venv\Scripts>activate

Opening SoVITS Web GUI Interface:
(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS\venv\Scripts>cd C:\Users\Yout-PC\Voice Clone\GPT-SoVITS

(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS>go-webui.bat

You will see this message open up as the go-wbui.bat opens. This may take a minute.

(venv) C:\Users\Yout-PC\Voice Clone\GPT-SoVITS>runtime\python.exe webui.py zh_EN

Running on local URL: http://0.0.0.0:9874

Optional Edits of Key files:
Edit .json file for fp16_run to be false (Optional):

• If you run into a rendering error you may need to do this. SoVITS utilizes mixed-precision training, which can save memory but is not supported on GPUs with older architectures, as is mine.

• For me I had to go into this file GPT_SoVITS\configs\s2.json and edit it in notepad.

• Look for "fp16_run": true and change it to false. Then save and try to run your render again.

• Gradio kept reverting to an older version but it isn’t very important. I was still able to get it working even though Gradio was using an older version and I tried to update a few times with no success. It said it did update and it shows up in the PIP list but when I am actually rendering it reverts back to an older version. I just wanted the warning message to go away every time it ran. CMD doesn’t breakup text well for my poor vision. Newer rigs for this purpose won’t have to do this.

Now a new web browser tab should open up and you will be using the SoVITS interface from here on out. Here’s an abbreviated breakdown of the process. The mission of this essay, more or less, was to get this thing installed, configured and working properly. Took way longer than what these YouTube tutorials will tell you. The process does work. The issue is proper computer processing power. At this point you can go and find other tutorials on how to use SoVITS. It’s pretty basic. I will leave a very short guide here but you should attack this as a great exercise in computer operations. If you have a gaming rig in the $5000 dollar range you may want to do this for your audio ventures. But I’d like to mention that if you can afford a 5K computer tower, you probably know how to use that some bitch properly and all of this won’t be an issue for you. I have never personally seen one single computer tower do some of the things I see. I know they are out there or people are building their own servers out of desktops to help out. I just do not think this was a great use of the time I put into it. Granted, I just wanted to see if I could get it to work. If it works and the quality still isn’t very great, then I will chalk it up to old hardware. I cloned Leonard Nimoy, Patrick Bateman, character from the 2000 film “American Psycho,” an AI cloned-version of Patrick Bateman, and myself. The quality isn’t terrible but if I were to get serious about using my own voiceovers for stuff I’d probably go and get an ElevenLabs subscription. The below Step-by-Step Guide will be divided up and read on the audio/video version in these different voices. The rest of this blog was narrated by the AI-clone used from new.text-to-speech.online/


Step-by-Step Guide: Cloning Voices Using SoVITS Web UI


  1. Download SoVITS Web UI:
    • Click on the first link in the video description to download SoVITS.
    • After downloading, extract the files into a new folder on your computer.
  2. Run SoVITS Web UI:
    • Open the folder where the files were extracted.
    • Run the file named go web UI.
  3. Launch the Web UI:
    • After running the file, click the Run button.
    • Allow the necessary permissions when prompted.
    • Wait for a moment and the SoVITS web interface will open.
  4. Voice Segmentation:
    • Upload an audio file that is at least 1 minute long.
    • Copy the path to your audio file and paste it in the corresponding field in the SoVITS interface.
    • Leave the rest of the settings unchanged.
    • Ensure the voice in the file has a normal speed and natural pauses; otherwise, segmentation may fail.
  5. Start Segmentation:
    • Click the Start button to begin voice segmentation.
    • A message will confirm that the segmentation process was successful.
  6. Check Segmentation Output:
    • Open the output folder and then navigate to the slicer opt folder.
    • Here, you’ll find that the voice has been divided into segments.
    • Copy the path to this folder and paste it in the corresponding field in the web interface.
    • Click Start again to proceed.
  7. Verify Text Recognition:
    • After processing, a .list file will appear in the output folder under ASR op.
    • Check the content of this file by opening it in Notepad.
    • Copy the text and paste it into the field in the SoVITS interface.
    • Click Play to listen to the segments and verify that the text was read correctly.
  8. Enter Model Information:
    • Go to the TTS tab in the web UI.
    • Enter the name of your model (e.g., "Name AI") in the first field.
    • The second field will automatically detect your graphics card. Do not change anything here.
    • Paste the .list file path from the ASR op folder into the left field.
    • Paste the slicer opt folder path into the right field.
    • Leave all other options as default.
    • Click the Start Formatting button to format the data for training.
  9. Start Training the Model:
    • Go to the Training tab.
    • Click the Start SoVITS Training button.
    • The training process will begin, and you’ll be able to monitor its progress through the console.
  10. Start GPT Training:
    • Once SoVITS training is completed, click Start GPT Training to train the model further.
    • Once finished, you'll receive a success message.
  11. Select the Model:
    • Go to the 1C tab and click the Refresh button to update the model paths.
    • Select the model with the highest value in both fields.
    • Click the Open TTS button to open the TTS web UI.
  12. Generate Audio:
    • Load an audio file from the slicer opt folder into the new TTS interface.
    • Open the ASR opt folder and copy the text from the .list file.
    • Paste this text into the input field for the TTS model.
    • Set the language to English and click Start to generate the voice output.
  13. Final Adjustments:
    • For better natural-sounding speech, adjust settings like Slice by English.
    • Review the generated voice output. The voice should now sound similar to the original with high quality.
  14. Result:
    • Listen to the final output. The cloned voice should be very similar to the original, with natural prosody and tone.


SoVITS on Older Machines
by David-Angelo Mineo
12/9/2024
2,741 Words