SecHeadset: A Practical Privacy Protection System for Real-time Voice Communication

SecHeadset is a practical, plug-and-play, and end-to-end privacy protection hardware system for real-time voice communication. It can prevent any entities other than the two communication parties from eavesdropping on the communication. SecHeadset is compatible with various VoIP applications, such as Telegram, Skype, and WhatsApp. It is also compatible with different communication types, such as VoIP and voice message.

Basic Idea

SecHeadset works as an audio relay between users and the communication device. The basic idea is to obfuscate user's voice before sending it to the communication device. In the receiver side, SecHeadset first retrieve the clean audio from the obfuscated audio and then plays it back to users. Specifically, we add phoneme-based noises to the voice for obfuscation. This type of noise has characteristics similar to human voice, which makes it robust to existing deep learning-based denoising techniques.

Illustration of the original, the obfuscated, and the retrieved voices

The following figure illustrates the spectrum and the time domain signal of the original, the obfuscated, and the retrieved voices processed by SecHeadset.

Audio Examples

Obfuscated Audio Examples in VoIP Channels

Applications	260-123440-0002	1188-133604-0015	3570-5696-0000	3575-170457-0024
Skype
Telegram
Viber
Messenger
Line
DingTalk

The original audios are sampled from the test-clean subset of LibriSpeech

Retrieved Clean Audio in VoIP Channels

Applications	260-123440-0002	1188-133604-0015	3570-5696-0000	3575-170457-0024
Skype
Telegram
Viber
Messenger
Line
DingTalk

Retrieved Clean Audio in Voice Message Channels

Applications	260-123440-0002	1188-133604-0015	3570-5696-0000	3575-170457-0024
Skype
Telegram
Viber
Messenger
Line
DingTalk
WhatsApp
WeChat

Retrieved Clean Audio in Telegram VoIP with Different Network Bandwidths

Bandwith	260-123440-0002	1188-133604-0015	3570-5696-0000	3575-170457-0024
1 mbps
500 kbps
100 kbps
60 kbps

Retrieved Clean Audio Under Defferent Noise Energy Levels (Telegram VoIP Channel)

SNR	260-123440-0002	1188-133604-0015	3570-5696-0000	3575-170457-0024
-2
-4
-6
-8
-9
-10
-12

The following figure shows the spectrogram of the first 2 seconds of the retrieved audio (3570-5696-0000) under different obfuscation noise levels. And the red rectangle marks out some typical regions that indicates the effect of noise levels. With the increase of noise levels, the spectrogram becomes more blurred and noising, especially for the harmonic in voices, making the voice harder to be recognized.