This site use Cookies. Read privacy policy

Conversational Intelligence

Analyze conversations in your company and sell more, understand users, increase UX

Cognitive Automation

Lower your customer care cost by automating repetitive processes

other products

Media Monitoring

Select API

Communication with the ASR server using WebSocket

1. The client connects to the ASR address, which /classify/asr ( wss:// waits for connections. 2. For connection, set and send the appropriate information in the HTTP header when establishing the connection:
  • Information about the audio format that will be streamed to the server. To this end, the customer sets up content-type. Supported arguments: – audio/l16;rate=8000 – audio samples in PCM 8kHz encoding; – audio/l16;rate=16000 – audio samples in PCM 16kHz encoding; – audio/x-alaw-basic – alaw codec; – audio/basic – mulaw codec; – audio/flac – FLAC codec.
  • Project identifier with which the client will connect X-Voicelab-Pid: PID, where PID is the appropriate project number (in our case PID = 109)
  • Project password to which the client will connect: X-Voicelab-Password: PASS, where PASS is the appropriate password (in our case PASS = fbcd6fbb37a10a6d44467918a67d6c54)
  • Configuration name to which the client will connect: X-Voicelab-Conf-Name: CONF-NAME, where CONF-NAME is the appropriate configuration name: – 8000_pl_PL or 16000_pl_PL for Polish language (8kHz or 16kHz sample rate), – 8000_en_US or 16000_en_US for English (8kHz or 16kHz sample rate), – 8000_ru_RU or 16000_ru_RU for Russian (8kHz or 16kHz sample rate), – 16000_de_DE for German (16kHz sample rate), – 8000_it_IT for Italian (8kHz sample rate)
3. The client sends subsequent audio packages in the form of websocket binary messages. After completing the audio stream, send a four-byte binary message of which all four bytes are zeros (in case you would have to send four zeros as the last audio data message, then divide this message into two, e.g. two bytes by two, and then send four zeros bytes terminating transmission), this is the transmission termination mark. 4. ASR returns the recognition as JSON document (websocket text message) of the form:

            "status": "string: OK or ERROR",
            "shift": "string: how many words will come in recognition",
            "words": "array: list of reckognized words",
            "start": "(**) array: list with words' start times",
            "end": "(**) array: list with words' end times",
            "error": "(*) string: type of error",
            "description": "(*) string: short decription of error"
fields marked with (*) are optional and appear only if the status is different "OK". If the status is different than "OK", then the connection is terminated. The type of error here will be either "BadRequest" or "Forbidden". the field marked with (**) is optional and appears depending on the server configuration.
  • An example of building recognition from the initial results returned by the ASR server during dictation.
            {"status": "OK","shift": "1", "words": ["a"]} 
            {"status": "OK","shift": "0", "words": ["ala"]}
            {"status": "OK","shift": "1", "words": ["pięknie"]}
            {"status": "OK","shift": "2", "words": ["śpi", "je"]}
            {"status": "OK","shift": "-1", "words": ["śpiewa"]}
Final recognition: „ala pięknie śpiewa”.