Back to home

Use PocketSphinx with a streamed microphone over TCP with GStreamer

For a long time, I'm thinking about a speech recognition for my home. I've checked available options:

I choose the second option, because of privacy, and because it's more tricky to implement (challenge!). But launching PocketSphinx on a Raspberry Pi is really fun, but slow...: not enough computing unit to do that.

For every issue a solution! Especially when you are working on a Linux distribution.

Stack

Libraries:

Devices:

During executions, you have to consider the server is 192.168.0.32 and the port 3000 is not already used (I read somewhere 3000 is not a conventional port for GStreamer).

First test

To be sure GStreamer is working, we will just stream a fake WAV audio sample from the client to the ALSA output of the server.

Server configuration

On the server, we open the tunnel:

gst-launch-1.0 tcpserversrc host=192.168.0.32 port=3000 ! audio/x-raw, endianness=1234, signed=true, width=16, depth=16, rate=44100, channels=1, format=S16LE ! audioconvert ! audioresample ! alsasink

Explanations:

Client configuration

It's time to turn on the sound. On the client:

gst-launch-1.0 audiotestsrc ! audio/x-raw, endianness=1234, signed=true, width=16, depth=16, rate=44100, channels=1, format=S16LE ! tcpclientsink host=192.168.0.32 port=3000

Explanations:

You must ear a fake sound on your computer. But it is not generated from it. Creepy awesome!

If it not works, check the installed packages. A good way is to use gst-inspect-1.0.

Stream microphone

To stream the microphone, the only thing to do is to replace audiotestsrc by alsasrc.

gst-launch-1.0 alsasrc ! audio/x-raw, endianness=1234, signed=true, width=16, depth=16, rate=44100, channels=1, format=S16LE ! tcpclientsink host=192.168.0.32 port=3000

Get transcription with PocketSphinx

PocketSphinx allows you to pipe voice recognition from GStreamer. The client will not change. His only job is to stream microphone, nothing more (dummy!).

On the server:

gst-launch-1.0 tcpserversrc host=192.168.0.32 port=3000 ! audio/x-raw, endianness=1234, signed=true, width=16, depth=16, rate=44100, channels=1, format=S16LE ! audioconvert ! audioresample ! pocketsphinx ! fakesink

Explanations:

I can speak, but I can't see anything about transcription. You can use this Python code to see in your terminal what you are saying:

import gi
gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst
GObject.threads_init()
Gst.init(None)

loop = GObject.MainLoop()

def element_message( bus, msg ):
        msg.get_structure().get_name()
        print "hypothesis= '%s'  confidence=%s\n" % (msg.get_structure().get_value('hypothesis'),msg.get_structure().get_value('confidence'))

pipeline = Gst.parse_launch('tcpserversrc host=192.168.0.32 port=3000 ! audio/x-raw, endianness=1234, signed=true, width=16, depth=16, rate=44100, channels=1, format=S16LE ! audioconvert ! audioresample ! pocketsphinx ! fakesink')

bus = pipeline.get_bus()
bus.add_signal_watch()
bus.connect('message::element', element_message)

pipeline.set_state(Gst.State.PLAYING)

loop.run()

Go further

The encryption part must be enforced. The local network is relatively secured, mostly with Ethernet (WiFi not recommended, it's more easy to spoof), but not a fortress.

With GStreamer, you can also stream video... A new road to go...

Resources