Communication with the ASR / NLU server using gRPC

​ The protocol buffer description for the gRPC API is in the file:

api/vlvipb.proto

​ located in the GitLab repository

1.1 Establishing communication

​ In addition, the following metadata is required to establish the connection:

​ • pid – project identifier.

​ • password – password.

​ • content-type – format of streamed data. Supported arguments:

​ – audio/l16;rate=8000 – audio samples in 8kHz PCM encoding;

​ – audio/l16;rate=16000 – audio samples in 16kHz PCM encoding;

​ – audio/x-alaw-basic – alaw codec;

​ – audio/flac – FLAC codec;

​ – audio/basic – mulaw codec.

​ • conf-name – configuration name, allows the user to choose the language and
sample rate. This parameter has to be compatible with sample rate parameter.

​ – 8000_pl_PL – Polish with 8kHz sample rate;

​ – 16000_pl_PL – Polish with 16kHz sample rate;

​ – 8000_en_US – English with 8kHz sample rate;

​ – 16000_en_US – English with 16kHz sample rate;

​ – 8000_ru_RU – Russian with 8kHz sample rate;

​ – 16000_ru_RU – Russian with 16kHz sample rate;

​ – 16000_de_DE – German with 16kHz sample rate;

​ – 8000_it_IT – Italian with 8kHz sample rate.

​ • no-input-timeout – number of milliseconds of silence after which the server
interrupts the statement
​ if the user has not said anything (has not yet started the statement).
​ For the value of 0, timeout is inactive.

​ • speech-complete-timeout – the number of milliseconds of silence after which
the server interrupts the statement in the event
that the user began to say something and after the silence appeared. For the value of 0, timeout is
inactive.

​ Then sent the samples may be sound in the format specified by the sooner
content-type.
​ The metadata values ​​used for configuration and authorization
(pid, password)
​ are provided in a separate file:

metadata_config.json

​ Establishing communication in the sample application written in Go (grpc.go
file):


func grpcClient(addr string, r io.Reader, c *config, m map[string]string) (string, time.Duration, error)
        

    client := vlviapb.NewVLVIAClient(conn)
    m["contenttype"] = c.MimeType
    if c.Pid != "" {
        m["pid"] = c.Pid
    }
    if c.Password != "" {
        m["password"] = c.Password
    }
    if c.ConfName != "" {
        m["conf-name"] = c.ConfName
    }
    if c.NoInputTimeout != 0 {
        m["no-input-timeout"] = strconv.Itoa(int(c.NoInputTimeout / time.Millisecond))
    }
    if c.SpeechCompleteTimeout != 0 {
        m["speech-complete-timeout"] = strconv.Itoa(int(c.SpeechCompleteTimeout / time.Millisecond))
    }
    ctx := metadata.NewOutgoingContext(context.Background(), metadata.New(m))

    stream, err := client.RecognizeStream(ctx)
    if err != nil {
        return "", 0, err
    }
        

1.2 Complete communication

​ Communication continues until the user calls the termination method. Recognition updates
​ are sent by the server on an ongoing basis. If they have not been set to zero
no-input-timeout or when speech-complete-timeout
receiving updates,
​ check the value of the timeout field. As long as it is NO_TIMEOUT, you can continue to send samples. If
you receive an update with the NO_INPUT_TIMEOUT or SPEECH_COMPLETE_TIMEOUT timeout value, it means that this
is the last update and you should stop sending samples. After sending a timeout different from NO_TIMEOUT,
the server still receives samples (so that the client program does not receive a write error), but still
does not interpret them. After receiving this timeout, you should stop sending samples and call the
CloseSend method on the connection object and then the server will close the connection (the client program
will receive an „error” EOF specifying the correct termination of the connection with the server).

Completion of communication is based on the cooperation of two goroutines: one sending audio data and the
other receiving text data and information about the occurrence of timeout. The receiving goroutine informs
the goroutine that sends data about the occurrence of a timeout (or other end condition) using a channel
called out.
Goroutine sending audio data has the form (grpc.go file):


    done := make(char *output)
    go formatResponse(stream, done, c.Verbose)
    var out *output

    b := make([]byte, 500)
    frames := vlviapb.AudioFrames{}
For:
    for {
        n, err := r.Read(b)
        if err != nil {
            if err == io.EOF {
                break
            }
            return "", 0, err
        }
        frames.Frames = b[:n]
        if err := stream.Send(&frames); err != nil {
            out = <-done
            return "", 0, err
        }
        select {
        default:
        case o := <-done:
            if o.timeout != vlviapb.TimeoutType_NO_TIMEOUT {
                log.Println("timeout:", o.timeout)
            } else {
                out = o
            }
            break For
        }
    }
    if err := stream.CloseSend(); err != nil {
        return "", 0, err
    }
    t := time.Now()
    if out == nil {
        out = <-done
        if out.timeout != vlviapb.TimeoutType_NO_TIMEOUT {
            log.Println("timeout:", out.timeout)
            out = <-done
        }
    }
    return out.recognition, time.Since(t), out.err
        

1.3 Formatting the response

​ The server response consists of partial recognition updates. A single update consists of two parameters:

​ • shift – means the change in the total length of the word table with
recognition after the update. It may be negative. Together with the number of words in the update, it allows
the determination of a new current recognition string.
​ • words – table of recognized words in a given update.

​ These two parameters allow you to update the table of recognized words, according to the formula:

  1. remove the (W - shift) last words from the current recognition chain,
    where
    W – is the length of the words given
    update table;
  2. add W words words from a given update..An example of building recognition from initial results.
     
    {"shift": 1, "words": ["a"], "result": ["a"]}
    {"shift": 0, "words": ["ala"], "result": ["ala"]}
    {"shift": 1, "words": ["pięknie"], "result": ["ala", "pięknie"]}
    {"shift": 2, "words": ["śpi", "je"], "result": ["ala", "pięknie", "śpi", "je"]}
    {"shift": -1, "words": ["śpiewa"], "result": ["ala", "pięknie", "śpiewa"]}
            
     Final recognition: ala pięknie śpiewa.
            

​ Formatting the answer in an example application written in Go – goroutine receiving data from ASR
(grpc.go file):


func formatResponse(stream vlviapb.VLVIA_RecognizeStreamClient, done chan<- *output, verbose bool) {
    var words []string
    timeout := vlviapb.TimeoutType_NO_TIMEOUT
    for {
        update, err := stream.Recv()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Println("recv error: ", err)
            done <- &output{err: err}
            return
        }
        if timeout != vlviapb.TimeoutType_NO_TIMEOUT {
            done <- &output{err: errors.New("expected no updates after timeout")}
            return
        }
        length := len(words) + int(update.Shift) - len(update.Words)
        if length < 0 || length > len(words) {
            err := fmt.Errorf("recv error: length out of range: text=%v, shift=%d, words=%v\n", words, update.Shift, update.Words)
            log.Println("recv error:", err)
            done <- &output{err: err}
        }
        if verbose {
            fmt.Println(update.Shift, update.Words)
        }
        words = append(words[:length], update.Words...)
        timeout = update.GetTimeout()
        if timeout != vlviapb.TimeoutType_NO_TIMEOUT {
            done <- &output{timeout: timeout}
        }
    }
    done <- &output{recognition: strings.Join(words, " "), err: nil}
}
        

1.4 Example implementation

​ In the GitLab repository there is a sample application in the Go language that uses:

api/vlviapb.proto

​ To build it, you must have at least version 1.12. In the folder:

example/golang/pkg/vlviapb/

​ There are also generated files protocol buffer based on
vlviapb.proto.
​ The application can be built using the command:


export GO111MODULE=on
go build  
        

​ from the folder example/golang.
​ It receives sound samples from the standard input and sends them to the given address.

​ The simplest ASR function test looks like this:


cat ../data/sample16kHz.wav | \
    ./example -addr demo.voicelab.pl:7722 \
    -pid PID -pass PASSWORD \
    -sample_rate 16000 \
    -conf_name 16000_pl_PL
        

​ where PID and PASSWORD are access parameters (located in the
example/golang/metadata_config.json file).

Use the option -c arg to pass the protocol parameters contained in the file
metadata_config.json. Parameters directly represent metadata variables. The
values ​​in the file .json are replaced by the application’s arguments if they
are given (the exception is the parameter contenttype – it is always set by
the application).


cat ../data/sample16kHz.wav | \
    ./example -addr demo.voicelab.pl:7722 \
    -c metadata_config.json \
    -sample_rate 16000\
    -conf_name 16000_pl_PL
        

​ Use the option -verbose, to track all updates:


cat ../data/sample16kHz.wav | \
    ./example -addr demo.voicelab.pl:7722 \
    -c metadata_config.json \
    -sample_rate 16000 \
    -conf_name 16000_pl_PL \
    -verbose
        

​ To use the microphone for testing, do the following:


arecord -r 16000 -f S16_LE | \
    ./example -addr demo.voicelab.pl:7722 \
    -c metadata_config.json \
    -sample_rate 16000 \
    -conf_name 16000_pl_PL \
    -speech_complete_timeout=2s
        

​ For sound recognition over a set period of time:


arecord -r 8000 -f S16_LE | \
    ./example -addr demo.voicelab.pl:7722 \
    -c metadata_config.json \
    -conf_name 8000_pl_PL \
    -sample_rate 8000 & \
    sleep 5 && killall arecord```