ByteDance Unveils StreamVoice: AI-Powered Live Voice Conversion Raises Deepfake Concerns and Misinformation Risks

ByteDance, the renowned Chinese technology firm responsible for the popular TikTok platform, has unveiled something new for its users—StreamVoice. This tool, leveraging generative-AI technology, enables users to seamlessly alter their voices to mimic others.

As of now, StreamVoice remains inaccessible to the general public, yet its introduction underscores the noteworthy progress in AI development. The tool facilitates the effortless creation of audio and visual impersonations of public figures, commonly referred to as "deepfakes." Notable instances include the use of AI to emulate the voices of President Joe Biden and Taylor Swift, a phenomenon particularly prevalent as the 2024 election looms.

Collaborating on this groundbreaking initiative are technical researchers from ByteDance and Northwestern Polytechnical University in China. It's imperative to note that Northwestern Polytechnical University, recognized for its collaborations with the Chinese military, should not be confused with Northwestern University in the United States.

In a recently published paper, the researchers underscore StreamVoice's capacity for "real-time conversion" of a user's voice to any desired alternative, requiring only a singular instance of speech from the target voice. The output unfolds at livestreaming speed, boasting a mere 124 milliseconds of latency—a significant achievement in light of historical limitations associated with AI voice conversion technologies, traditionally effective in offline scenarios.

The researchers attribute StreamVoice's success to recent advancements in language models, enabling the creation of a tool that performs live voice conversion with high speaker similarity for both familiar and unfamiliar voices. Experiments, as detailed in the paper, emphasize the tool's efficacy in streaming speech conversion while maintaining performance comparable to non-streaming voice conversion systems.

Referring to Meta's Llama large language model, a prominent entity in the AI landscape, the paper details the utilization of the "LLaMA architecture" in constructing StreamVoice. Additionally, the researchers incorporated open-source code from Meta's AudioDec, described by Meta as a versatile "plug-and-play benchmark for audio codec applications." Training primarily on Mandarin speech datasets and a multilingual set featuring English, Finnish, and German, the researchers achieved the tool's proficiency.

Although the researchers refrain from prescribing specific use cases for StreamVoice, they acknowledge potential risks, such as the dissemination of misinformation or phone fraud. Users are encouraged to report instances of illegal voice conversion to appropriate authorities.

AI experts, cognizant of advancing technology, have long cautioned against the escalating prevalence of deepfakes. A recent incident involved a robocall deploying a deepfake of President Biden, urging people not to vote in the New Hampshire primary. Authorities are currently investigating this deceptive robocall, underscoring the urgent need for vigilance in the face of evolving AI capabilities.

Content generated using AI and reviewed by humans. Photo: DIW - AIGen

Read next: Data Shows Most Popular AI Tools in 2023, With ChatGPT Coming At Top
Previous Post Next Post