VoiceLab

Communication with the ASR server using WebSocket

See our GitLab repository for a working example how to use our API. For a full description, read below.

1. The client connects to the ASR address, which /classify/asr ( wss://demo.voicelab.ai/classify/asr) waits for connections.

2. For connection, set and send the appropriate information in the HTTP header when establishing the connection:

Information about the audio format that will be streamed to the server. To this end, the customer sets up content-type.
- Supported arguments:
  – audio/l16;rate=8000 – audio samples in PCM 8kHz encoding;
  – audio/l16;rate=16000 – audio samples in PCM 16kHz encoding;
  – audio/x-alaw-basic – alaw codec;
  – audio/basic – mulaw codec;
  – audio/flac – FLAC codec.
Project identifier with which the client will connect:
- X-Voicelab-Pid: PID, where PID is the appropriate project number (in our case PID = 109)
Project password to which the client will connect:
- X-Voicelab-Password: PASS, where PASS is the appropriate password (in our case PASS = fbcd6fbb37a10a6d44467918a67d6c54)
Configuration name to which the client will connect:
- X-Voicelab-Conf-Name: CONF-NAME, where CONF-NAME is the appropriate configuration name:
  – 8000_pl_PL or 16000_pl_PL for Polish language (8kHz or 16kHz sample rate),
  – 8000_en_US or 16000_en_US for English (8kHz or 16kHz sample rate),
  – 8000_ru_RU or 16000_ru_RU for Russian (8kHz or 16kHz sample rate),
  – 16000_de_DE for German (16kHz sample rate),
  – 8000_it_IT for Italian (8kHz sample rate)

3. The client sends subsequent audio packages in the form of websocket binary messages. After completing the audio stream, send a four-byte binary message of which all four bytes are zeros (in case you would have to send four zeros as the last audio data message, then divide this message into two, e.g. two bytes by two, and then send four zeros bytes terminating transmission), this is the transmission termination mark.

4. ASR returns the recognition as JSON document (websocket text message) of the form:


        {
            "status": "string: OK or ERROR",
            "shift": "string: how many words will come in recognition",
            "words": "array: list of reckognized words",
            "start": "(**) array: list with words' start times",
            "end": "(**) array: list with words' end times",
            "error": "(*) string: type of error",
            "description": "(*) string: short decription of error"
        }

fields marked with (*) are optional and appear only if the status is different "OK". If the status is different than "OK", then the connection is terminated. The type of error here will be either "BadRequest" or "Forbidden". the field marked with (**) is optional and appears depending on the server configuration.

An example of building recognition from the initial results returned by the ASR server during dictation.


        {"status": "OK","shift": "1", "words": ["a"]} 
        {"status": "OK","shift": "0", "words": ["ala"]}
        {"status": "OK","shift": "1", "words": ["pięknie"]}
        {"status": "OK","shift": "2", "words": ["śpi", "je"]}
        {"status": "OK","shift": "-1", "words": ["śpiewa"]}

Final recognition: „ala pięknie śpiewa”.

Communication with the ASR / NLU server using gRPC

The protocol buffer description for the gRPC API is in the file: api/vlvipb.proto located in the GitLab repository

1.1 Establishing communication

In addition, the following metadata is required to establish the connection:

pid – project identifier. (in our case its 109)
password – password. (in our case its fbcd6fbb37a10a6d44467918a67d6c54)
content-type – format of streamed data.
- Supported arguments:
  – audio/l16;rate=8000 – audio samples in 8kHz PCM encoding;
  – audio/l16;rate=16000 – audio samples in 16kHz PCM encoding;
  – audio/x-alaw-basic – alaw codec;
  – audio/flac – FLAC codec;
  – audio/basic – mulaw codec.
conf-name – configuration name, allows the user to choose the language and sample rate.
- This parameter has to be compatible with sample rate parameter.
  – 8000_pl_PL – Polish with 8kHz sample rate;
  – 16000_pl_PL – Polish with 16kHz sample rate;
  – 8000_en_US – English with 8kHz sample rate;
  – 16000_en_US – English with 16kHz sample rate;
  – 8000_ru_RU – Russian with 8kHz sample rate;
  – 16000_ru_RU – Russian with 16kHz sample rate;
  – 16000_de_DE – German with 16kHz sample rate;
  – 8000_it_IT – Italian with 8kHz sample rate.
no-input-timeout – number of milliseconds of silence after which the server interrupts the statement if the user has not said anything (has not yet started the statement). For the value of 0, timeout is inactive.
speech-complete-timeout – the number of milliseconds of silence after which the server interrupts the statement in the event that the user began to say something and after the silence appeared. For the value of 0, timeout is inactive. Then sent the samples may be sound in the format specified by the sooner content-type. The metadata values used for configuration and authorization (pid, password) are provided in a separate file: metadata_config.json Establishing communication in the sample application written in Go (grpc.go file):


        func grpcClient(addr string, r io.Reader, c *config, m map[string]string) (string, time.Duration, error)


            client := vlviapb.NewVLVIAClient(conn)
            m["contenttype"] = c.MimeType
            if c.Pid != "" {
                m["pid"] = c.Pid
            }
            if c.Password != "" {
                m["password"] = c.Password
            }
            if c.ConfName != "" {
                m["conf-name"] = c.ConfName
            }
            if c.NoInputTimeout != 0 {
                m["no-input-timeout"] = strconv.Itoa(int(c.NoInputTimeout / time.Millisecond))
            }
            if c.SpeechCompleteTimeout != 0 {
                m["speech-complete-timeout"] = strconv.Itoa(int(c.SpeechCompleteTimeout / time.Millisecond))
            }
            ctx := metadata.NewOutgoingContext(context.Background(), metadata.New(m))
        
            stream, err := client.RecognizeStream(ctx)
            if err != nil {
                return "", 0, err
            }

1.2 Complete communication

Communication continues until the user calls the termination method. Recognition updates are sent by the server on an ongoing basis. If they have not been set to zero no-input-timeout or when speech-complete-timeout receiving updates, check the value of the timeout field. As long as it is NO_TIMEOUT, you can continue to send samples. If you receive an update with the NO_INPUT_TIMEOUT or SPEECH_COMPLETE_TIMEOUT timeout value, it means that this is the last update and you should stop sending samples. After sending a timeout different from NO_TIMEOUT, the server still receives samples (so that the client program does not receive a write error), but still does not interpret them. After receiving this timeout, you should stop sending samples and call the CloseSend method on the connection object and then the server will close the connection (the client program will receive an „error” EOF specifying the correct termination of the connection with the server). Completion of communication is based on the cooperation of two goroutines: one sending audio data and the other receiving text data and information about the occurrence of timeout. The receiving goroutine informs the goroutine that sends data about the occurrence of a timeout (or other end condition) using a channel called out. Goroutine sending audio data has the form (grpc.go file):


            done := make(char *output)
            go formatResponse(stream, done, c.Verbose)
            var out *output
        
            b := make([]byte, 500)
            frames := vlviapb.AudioFrames{}
        For:
            for {
                n, err := r.Read(b)
                if err != nil {
                    if err == io.EOF {
                        break
                    }
                    return "", 0, err
                }
                frames.Frames = b[:n]
                if err := stream.Send(&frames); err != nil {
                    out = <-done
                    return "", 0, err
                }
                select {
                default:
                case o := <-done:
                    if o.timeout != vlviapb.TimeoutType_NO_TIMEOUT {
                        log.Println("timeout:", o.timeout)
                    } else {
                        out = o
                    }
                    break For
                }
            }
            if err := stream.CloseSend(); err != nil {
                return "", 0, err
            }
            t := time.Now()
            if out == nil {
                out = <-done
                if out.timeout != vlviapb.TimeoutType_NO_TIMEOUT {
                    log.Println("timeout:", out.timeout)
                    out = <-done
                }
            }
            return out.recognition, time.Since(t), out.err

1.3 Formatting the response

The server response consists of partial recognition updates. A single update consists of two parameters:

shift – means the change in the total length of the word table with recognition after the update. It may be negative. Together with the number of words in the update, it allows the determination of a new current recognition string.
words – table of recognized words in a given update. These two parameters allow you to update the table of recognized words, according to the formula:
- remove the (W - shift) last words from the current recognition chain, where W – is the length of the words given update table;
- add W words words from a given update..An example of building recognition from initial results.

        {"shift": 1, "words": ["a"], "result": ["a"]}
        {"shift": 0, "words": ["ala"], "result": ["ala"]}
        {"shift": 1, "words": ["pięknie"], "result": ["ala", "pięknie"]}
        {"shift": 2, "words": ["śpi", "je"], "result": ["ala", "pięknie", "śpi", "je"]}
        {"shift": -1, "words": ["śpiewa"], "result": ["ala", "pięknie", "śpiewa"]}

 Final recognition: ala pięknie śpiewa.

Formatting the answer in an example application written in Go – goroutine receiving data from ASR (grpc.go file):


        func formatResponse(stream vlviapb.VLVIA_RecognizeStreamClient, done chan<- *output, verbose bool) {
            var words []string
            timeout := vlviapb.TimeoutType_NO_TIMEOUT
            for {
                update, err := stream.Recv()
                if err == io.EOF {
                    break
                }
                if err != nil {
                    log.Println("recv error: ", err)
                    done <- &output{err: err}
                    return
                }
                if timeout != vlviapb.TimeoutType_NO_TIMEOUT {
                    done <- &output{err: errors.New("expected no updates after timeout")}
                    return
                }
                length := len(words) + int(update.Shift) - len(update.Words)
                if length < 0 || length > len(words) {
                    err := fmt.Errorf("recv error: length out of range: text=%v, shift=%d, words=%v\n", words, update.Shift, update.Words)
                    log.Println("recv error:", err)
                    done <- &output{err: err}
                }
                if verbose {
                    fmt.Println(update.Shift, update.Words)
                }
                words = append(words[:length], update.Words...)
                timeout = update.GetTimeout()
                if timeout != vlviapb.TimeoutType_NO_TIMEOUT {
                    done <- &output{timeout: timeout}
                }
            }
            done <- &output{recognition: strings.Join(words, " "), err: nil}
        }

1.4 Example implementation

In the GitLab repository there is a sample application in the Go language that uses: api/vlviapb.proto To build it, you must have at least version 1.12. In the folder: example/golang/pkg/vlviapb/ There are also generated files protocol buffer based on vlviapb.proto. The application can be built using the command:


        export GO111MODULE=on
        go build

from the folder example/golang. It receives sound samples from the standard input and sends them to the given address. The simplest ASR function test looks like this:


        cat ../data/sample16kHz.wav | \
            ./example -addr demo.voicelab.pl:7722 \
            -pid PID -pass PASSWORD \
            -sample_rate 16000 \
            -conf_name 16000_pl_PL

where PID and PASSWORD are access parameters (located in the example/golang/metadata_config.json file). Use the option -c arg to pass the protocol parameters contained in the file metadata_config.json. Parameters directly represent metadata variables. The values in the file .json are replaced by the application’s arguments if they are given (the exception is the parameter contenttype – it is always set by the application).


        cat ../data/sample16kHz.wav | \
            ./example -addr demo.voicelab.pl:7722 \
            -c metadata_config.json \
            -sample_rate 16000\
            -conf_name 16000_pl_PL

Use the option -verbose, to track all updates:


        cat ../data/sample16kHz.wav | \
            ./example -addr demo.voicelab.pl:7722 \
            -c metadata_config.json \
            -sample_rate 16000 \
            -conf_name 16000_pl_PL \
            -verbose

To use the microphone for testing, do the following:


        arecord -r 16000 -f S16_LE | \
            ./example -addr demo.voicelab.pl:7722 \
            -c metadata_config.json \
            -sample_rate 16000 \
            -conf_name 16000_pl_PL \
            -speech_complete_timeout=2s

For sound recognition over a set period of time:


        arecord -r 8000 -f S16_LE | \
            ./example -addr demo.voicelab.pl:7722 \
            -c metadata_config.json \
            -conf_name 8000_pl_PL \
            -sample_rate 8000 & \
            sleep 5 && killall arecord```

Communication with the ASR server using HTTP

See our GitLab repository for a working example how to use our API. For a full description, read below.

1. The client connects to the ASR address, which /classify ( https://demo.voicelab.ai/classify) waits for connections.

2. For connection, set and send the appropriate information in the HTTP header when establishing the connection:

Information about the audio format that will be streamed to the server. To this end, the customer sets up content-type.
- Supported arguments:
  – audio/l16;rate=8000 – audio samples in PCM 8kHz encoding;
  – audio/l16;rate=16000 – audio samples in PCM 16kHz encoding;
  – audio/x-alaw-basic – alaw codec;
  – audio/basic – mulaw codec;
  – audio/flac – FLAC codec.
Project identifier with which the client will connect:
- X-Voicelab-Pid: PID, where PID is the appropriate project number (in our case PID = 109)
Project password to which the client will connect:
- X-Voicelab-Password: PASS, where PASS is the appropriate password (in our case PASS = fbcd6fbb37a10a6d44467918a67d6c54)
Configuration name to which the client will connect:
- X-Voicelab-Conf-Name: CONF-NAME, where CONF-NAME is the appropriate configuration name:
  – 8000_pl_PL or 16000_pl_PL for Polish language (8kHz or 16kHz sample rate),
  – 8000_en_US or 16000_en_US for English (8kHz or 16kHz sample rate),
  – 8000_ru_RU or 16000_ru_RU for Russian (8kHz or 16kHz sample rate),
  – 16000_de_DE for German (16kHz sample rate),
  – 8000_it_IT for Italian (8kHz sample rate)

Before sending actual audio data, you should send Content-Length in header (alternativelly „Transfer-Encoding: chunked” is also supported)

is here! 🎉

VoiceLab.AI, leader in Conversational AI now brings TRURL, an instruction-following large language model (LLM) which has been fine-tuned for number of business domains such as e-commerce and customer support.

Vencode harnesses TRURL to build a company chat system, seamlessly integrating information from provided documents and the website for enhanced communication within the organization.