I cloned my voice with AI and it was unsettlingly good

Remember the “seeing doubles” scene from Mission: Impossible 3? That’s when Ethan Hunt (Tom Cruise) forces someone to read a peculiar poem. Allegedly, the poem contained all the allophones needed to clone the victim’s voice. At gunpoint, the man reads it, and after a few seconds of compiling, Hunt’s team has a perfect vocal copy. That kind of technology exists now. You don’t need to be a secret agent or have access to classified government tech to use it. I know because I cloned my own voice. Of course, a voice as ordinary as mine won’t open top-secret doors, but can I use it to unlock my phone through Siri? As ridiculous as it sounds, I tried. What happened next was somewhere between unsettling and impressive—but even getting to that point was a story on its own. Setting up Chatterbox for voice cloning Easier said than done I settled on Chatterbox. It’s a free and open-source TTS model, but th

2025年10月06日 04:40:04 0 3

I cloned my voice with AI and it was unsettlingly good

Remember the “seeing doubles” scene from Mission: Impossible 3? That’s when Ethan Hunt (Tom Cruise) forces someone to read a peculiar poem. Allegedly, the poem contained all the allophones needed to clone the victim’s voice. At gunpoint, the man reads it, and after a few seconds of compiling, Hunt’s team has a perfect vocal copy.

That kind of technology exists now. You don’t need to be a secret agent or have access to classified government tech to use it. I know because I cloned my own voice. Of course, a voice as ordinary as mine won’t open top-secret doors, but can I use it to unlock my phone through Siri? As ridiculous as it sounds, I tried. What happened next was somewhere between unsettling and impressive—but even getting to that point was a story on its own.

Setting up Chatterbox for voice cloning

Easier said than done

I settled on Chatterbox. It’s a free and open-source TTS model, but the main reason I picked it was that nearly every other good voice cloning tool is hopelessly tied to NVIDIA. They need CUDA, which my RX 6700 XT doesn’t support.

Almost finished with setting up Chatterbox on Windows, I realized it still wouldn’t work with my AMD card. My best bet was to set it up in WSL (Windows Subsystem for Linux) so I could use AMD’s ROCm stack. Lo and behold, after hours of tinkering, I found out ROCm doesn’t even support my card. What a bummer. I’d spent nearly a whole day downloading drivers and battling dependency errors, only to end up regretting buying an AMD GPU two years ago. But after burning an entire weekend on it, there was no way I was walking away empty-handed. I decided to run it on CPU-only and give up on GPU acceleration.

Console logs of Chatterbox voice cloning running in WSL — Image by Amir Bohlooli; NAN.

Chatterbox isn’t exactly plug-and-play. It’s meant to run locally, but it’s got plenty of moving parts under the hood. It runs on Python — and unfortunately, Python really hates me. I set up a virtual environment, installed every dependency manually, and chased down dozens of build errors that looked like they’d been sent straight from hell. But once everything was finally in place, it ran smoothly. I fired up the server with this command:

python server.py --host 0.0.0.0 --api-port 8000 --ui-port 7860

That spins up both the REST API and a Gradio-based web interface. It’s all running off the CPU, since my RX 6700 XT gets no ROCm love under WSL, but it still works surprisingly well — at the cost of my CPU fans spinning faster than I’ve ever heard them before.

Building the voice model

One clip, one seed, and a lot of CPU noise

Chatterbox TTS UI showing the text box — Image by Amir Bohlooli; NAN.

Chatterbox, as I mentioned earlier, runs on a Gradio-based UI. It’s surprisingly well thought out for something still under active development. The main page includes a text box for input text, a list of predefined voices, parameters to tweak generation and server behavior, and voice cloning. The latter was what I was most interested in. You upload a short clip (under 30 seconds), and Chatterbox trains a model to add your voice to the list of selectable options.

Chatterbox TTS showing the generation parameters — Image by Amir Bohlooli; NAN.

The parameters are where the real fun begins. There are plenty to play with, along with a few presets like Standard Narration, Expressive Monologue, and Upbeat Advertisement. These adjust settings such as Temperature, Exaggeration, CFG Weight, and most importantly, the Generation Seed. Like most neural networks, Chatterbox has that familiar cocktail of randomness and temperature. This is something I’ve run into before with AI music generators. Even if you keep every parameter identical, your results will vary, because the seed changes. So, if you find a seed that sounds just right, write it down — you’ll thank yourself later.

For a quick test, I took a four-second clip of Arthur Morgan’s voice from Red Dead Redemption 2 and fed it into Chatterbox, then had it read a short passage of text. You can listen to the result below.

It takes about 50 seconds to generate 160 characters of speech using a cloned voice. I’m sure it’d be much faster with GPU acceleration — but AMD won’t have that. The CPU hits 100%, temperatures climb, and the fans spin like turbines. For 50 seconds, my Intel 13400 genuinely believes I’m playing Cyberpunk. But it’s only 50 seconds.

The Siri test

My clone meets Apple’s assistant

Chatterbox TTS showing the generated audio — Image by Amir Bohlooli; NAN.

I spent a while cloning the voices of my family and friends, and then freaking them out by sending clips of things they’d never said. I did warn them it was machine-generated, though I kind of regret that now. It would’ve been fun to see whether they could tell. Human judgment is subjective anyway. The next best test was to see what a machine thought of the clone. Specifically, Siri. Would Siri activate on my iPhone if I generated a clip of my cloned voice saying, “Hey Siri, how’s the weather?”

I recorded the short voice memo above on my phone and fed it to Chatterbox. In Mission Impossible 3, the target read a linguist-crafted poem that contained every allophone needed for a perfect voice match. I didn’t bother with that, mostly because I realized the moment I start “narrating,” my voice shifts from how I naturally speak. (If you’re curious, the linguist actually wrote about that poem on their blog.)

Long story short, it worked. My voice clone asked Siri about the weather, and Siri answered. When I tried the same command using a different cloned voice, Siri stayed silent. Then I had my clone ask Siri to call the emergency number — and it did. My original goal had been to build a TTS plugin for Obsidian and pair it with my voice-note setup, but without GPU support on AMD, that plan’s shelved. So this is about the extent of what I could get with Chatterbox TTS on my computer. It does make me wonder though. If I had smart locks on my doors, would my voice clone have been able to unlock them?