This site use Cookies. Read privacy policy
OK
OK
API
Resources
Contact

Conversational Intelligence

Analyze conversations in your company and sell more, understand users, increase UX

Cognitive Automation

Lower your customer care cost by automating repetitive processes

other products

Wordlify
Subtitles
Dictate
Media Monitoring

Transform me!

What do you think, when you hear ‘Transformer’? Is it a robot? A component used in an electrical circuit? Or maybe a neural network?  If you want to find out something about the last one, welcome to read it!

Transformer

Transformers are the family of neural networks mainly used in Natural Language Processing (NLP) and were introduced relatively recently – in 2017 by Vaswani et al., in the paper “Attention Is All You Need” (link). A fancy title indeed communicates the main idea of the architecture.

Previously, in NLP there were mostly used recurrent neural networks (RNN). Although having satisfying results, they had a drawback of computation time, as they consider a long sequence of words at a time. Here, as a rescue come transformers because they don’t operate on the whole sequence but only on the attention relationships between embeddings (numerical representation of words).

What is actually an attention mechanism?

Previous architectures were based on the encoder-decoder structure. The encoder had to create a vector representing the input sentence and then passed as an input to the decoder.

Attention mechanism assumes passing all the encoder states to the decoder so that the model can choose which information it considers as most important.

In figure 1 you can see the idea of the attention mechanism in the translation process. Although the idea might seem simple, it was a breakthrough in the field of natural language processing.

Attention mechanism

When we consider a self-attention mechanism the difference is that the choice of which part of the input is the most important is made during the encoding process – inputs interact with each other scoring with the “attention points”.

The Transformer architecture

As you can see in figure 2, the encoder consists of n layers with two sub-layers inside each of them. One is a Multi-Head Attention that independently learns the self-attention between words and the second one is a fully connected feed-forward network. The sublayers are joined together by the residual connections followed by the normalization block.

The Transformer architecture

The decoder has a pretty similar structure, however, each layer has additionally a third sub-layer, which is a Multi-Head Attention over the output of the encoder.

The Multi-Head Attention mechanism is performed on three vectors – Query, Key and Value. The query may be considered as a current word (embedding), while value is the information it carries. A key is just a form of order of values, like indexing.

Real-life applications

Transformers allow for text analysis from a different perspective – not as a sequence of words, but rather the gist of the message. It’s a huge advantage as normally the order of words in the sentence differs depending on the language, speaker, etc. Although Transformers are more efficient in terms of computation, they need a huge amount of data to be trained which is a common problem regarding machine-learning issues.

Fortunately, there are models that might be used as a base for transfer learning like e.g. HerBERT (link). This model was introduced in 2021 by polish scientists from Allegro.pl. It is based on the BERT – the most known model for NLP created by Google and adjusted to the difficult polish language. In the KLEJ classification, it managed to reach 88,4 out of 100 points which is the state of the art result regarding polish models.

To use the model you can simply put it like this:


from transformers import AutoTokenizer, AutoModel
 
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
 
output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "Język – ukształtowany społecznie system budowania wypowiedzi, używany w procesie komunikacji.",
                "Język służy do przedstawiania rzeczywistości dotyczącej przedmiotów, czynności czy abstrakcyjnych pojęć za pomocą znaków.",
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

In this case, you have to define a tokenizer – a tool to divide input into separate tokens – words and a model that you want to use – i.e. herbert-base-cased. As an input to the model, you put a tokenized text.

Depending on what you want to do, you can use HerBERT with an appropriate Head Model for different purposes, i.e.:

Predicting a masked word in the sentence,

Predicting the next word,

Predicting the next sentence,

Answering questions

As an output, you get a desired word or sequence of words.

Use advanced AI-fueled technologies to improve your business

Vision Transformer

Computer Vision is one of the fields where machine learning is gaining the most interest. Most of the solutions apply to vision problems. Even analysis of a signal is sometimes transformed to an ‘image form’ and then processed by a network. If so, why not give it a try for a reverse operation?

The usage of Transformer in image recognition was introduced in the paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (link) in 2020 by Dosovitskiy et al.

Architecture

The main idea was to divide an image into smaller patches of the same shape. Let’s imagine, that each of these patches is a ‘word’. Next, we have to vectorize them. This means, putting all the values from one patch into a vector with a given order. From each vector x then are created functions z with parameters W and b which are the parameters of the network learned in the training process. Additionally, there is added information about positional encoding. It is important to keep the correct order of the patches. Then the Multi-Head Self-Attention layer comes and the magic happens, as in the original Transformer.

As you can see in figure 3, the whole model does not vary much. It’s the input preparation that has to be different than in the original transformer.

Vision Transformer architecture

To use the Vision Transformer model you can use a timm library:


import timm
import PIL
import torchvision.transforms as T
 
model_name = "vit_base_patch16_224"
model = timm.create_model(model_name, pretrained=True)
 
IMG_SIZE = (224, 224)
NORMALIZE_MEAN = (0.5, 0.5, 0.5)
NORMALIZE_STD = (0.5, 0.5, 0.5)
transforms = [
              T.Resize(IMG_SIZE),
              T.ToTensor(),
              T.Normalize(NORMALIZE_MEAN, NORMALIZE_STD),
              ]
 
img = PIL.Image.open('samochod.jpg')
img_tensor = transforms(img).unsqueeze(0)
 
output = model(img_tensor)

You can define a model from the library. It’s an already pre-trained model, ready to use. As an input, you provide an image as a tensor – a common type of data used in computer vision. As an output, you get a class to which the object belongs and a score of confidence. It’s important that you don’t have to do all the input preprocessing manually, this is done for you 😊

Transformer vs Vision Transformer

Both Transformer and Vision Transformer are based on the same idea – attention. As long as the model engine is the same in both of them, a distinct difference is visible in the input and output. A standard form of providing text into the model is tokenization. The image has to be transferred to a similar form – cut into small pieces and put in order. The output also differs as the aim of the task is different. When providing text, we want to “explore the unknown” – see something that we do not know yet, while during the image processing we rather want to analyze what is visible in it.

Most of the commonly used models in NLP are based on the Transformer. Mostly, they are used in bots as a form of communication with the customer.

When it comes to Vision Transformers, they gain popularity among other great models. They can be used in image classification, detection or segmentation, for example in medical image analysis.

Use it or not?

As every model, Transformers have their pros and cons. Training a valuable transformer from scratch is extremely hard – huge amounts of data and computational power are needed. However, usage of already existing ones gives an opportunity to benefit from their ability to “read the context”. Considering various problems – context is rather more important than details.

To sum up – Attention is all you need! 😊

Author: Maria Ferlin, Politechnika Gdańska

Resources:

COVID-19 – the source of the heated discussion

MOTIVATION

Without a doubt, COVID-19 has had a great impact on today’s world. It has changed nearly all aspects of daily life and has become part and parcel of social discussion. Due to many COVID-19 restrictions, the majority of conversations have moved to social media, especially to Twitter which is one of the biggest social media platforms in the world. Since the opinions posted there highly influence society’s attitude towards COVID-19, it is important to understand Twitter users’ perspectives on the global pandemic. Unfortunately, the number of tweets posted on the Internet is enormous, thus an analysis performed by humans is impossible. Thankfully, for this purpose, one of the fundamental tasks in Natural Language Processing – sentiment analysis – can be performed which classifies text data into i.e. 3 categories (negative, neutral, positive).

Today, we would like to share with you some key insights on how we performed such sentiment analysis based on scraped Polish tweets related to COVID-19.

TWITTER SCRAPPING

In the beginning, we had to scrape the relevant tweets from Twitter.

Since we were particularly interested in analyzing the pandemic situation in Poland, we downloaded tweets which:

were written in Polish,

were posted from the beginning of the pandemic (March 2020) until the beginning of the holidays (July 2021),

contain at least one word from the empirically created list of COVID-19 related keywords (such as “covid” or “vaccination” etc.)

With the following approach, we successfully managed to scrape more than 2 million tweets with their contents and variety of metadata such as:

the number of likes under a given tweet,

the number of replies under a given tweet,

the number of shares of a given tweet, etc.

TRAINING DATASET

However, for the purpose of training the model, we also had to obtain data with the previously labeled sentiment (negative, neutral, positive).

Unfortunately, there is currently no dataset that is publicly available containing Polish tweets about COVID-19. That is why we had to extend the publicly available Twitter dataset from CLARIN.SI repository with 100 personally labeled COVID-19 tweets in order to train a model on a sample of domain-specific texts.

DATA PREPROCESSING

Before starting the training process, the data preprocessing step was performed. It was especially important in our case since the Twitter data contains a lot of messy text such as reserved Twitter words (eg. RT – for retweets) and users’ mentions. With the use of the tweet-preprocessor library, we removed all of these unnecessary text parts that could become noise for a model.

Moreover, we replaced the links with the special token $URL$ so that each link is perceived by the model as the same one. It was particularly important since in the later phase we noticed that the information of link presence is valuable for the model predictions.

At the end of the preprocessing stage, we decided to replace the emoticons with their real meaning (eg. “:)” was changed to “happy” and “:(” was replaced by “sad” etc.).

Use advanced AI-fueled technologies to improve your business

MODEL ARCHITECTURE

Having cleaned the data, we were able to build a model. We decided to base it on HerBERT (Polish version of BERT) since it receives state-of-the-art results in the area of text classification. We followed it with two layers of a bidirectional gated recurrent unit and a fully connected layer.

The BPE-Dropout tokenizer from HerBERT (which changes a text into tokens before passing input to HerBERT) was extended with additional COVID-19 tokens so that it would recognize (not separate) the basic coronavirus words.

MODEL PERFORMANCE

MetricScore
F1-Score70.51%
Accuracy72.14%
Precision70.60%
Recall70.44%

The model receives relatively good results even though Twitter data is often considered ambiguous when it comes to the sentiment label itself and ironic.

DATA ANALYSIS

With the trained model, the sentiment for previously scraped COVID-19 tweets was predicted. This allowed us to perform a more detailed analysis, which was based not only on collected metadata (eg. likes, replies, retweets) but also on predicted sentiment.

When the weekly number of tweets for each sentiment is visualized over time, some negative “picks” can be observed

In the majority of cases, they are strictly correlated with important events/decisions in Poland such as:

Postponement of the presidential election

Presidential campaign

The beginning of the second wave of COVID-19

The return of hard lockdown restrictions

The beginning of vaccination in Poland

The beginning of the third wave of COVID-19

The daily record of infections (over 35,000 new cases)

When we decided to dive deeper into the content of negative Tweets, these were the most frequently used negative expressions:

Undoubtedly, they are slightly biased towards the keywords based on which we scraped the Tweets, but overall they give a good sense of topics covered in negative tweets i.e. politics and popular tags used in fake news activities.

Moreover, we were interested in analyzing who is the most frequently tagged account in these negative tweets. It came out that:

most frequently the politicians are blamed for their pandemic decisions

the accounts of the most popular television programs in Poland are a big part of COVID-19 discussions

Last but not least, we decided to analyze the Twitter accounts that influence society the most with their negative tweets (those that gain relatively the most likes/replies/retweets).
Unsurprisingly, these are politicians from the Polish opposition who are present at the top of the list.

SUMMARY

To sum up, as we can all see, sentiment analysis accompanied by data analysis can give us interesting insight into the study community’s characteristics. It brings a lot of benefits and allows us to utilize huge amounts of text data effectively.

Authors: Michał Laskowski, Filip Żarnecki

Explainable Artificial Intelligence (XAI) in Sentiment Analysis

A way to grasp what your NLP model is trying to communicate.

The problem

Sentiment Analysis is basically a recognition of an utterance’s emotional tone – in our case, the AI solution tells us whether given statement is negative, neutral or positive:

We have already described our approach, data used, processing of data and model architecture in more detail in a previous article. In short, we use a solution based on HerBERT and train it using labeled Twitter data.

We have decided to make use of XAI (explainable AI) in order to better understand our model’s decision-making. We wanted to see which features influence the final verdict the most and whether we can identify some room for improvement.

Explainable AI – what and why?

What is XAI?

XAI (eXplainable AI) is an AI concept that enables us, humans, to understand predicted results better. We open the black box, you could say.

Most of us have heard of the term ‘black box’ when referring to Deep learning models. It is defined as a complex system or device whose internal workings are hidden or not readily understood. It has originated from the fact that usually even model developers cannot state why the AI has come to a specific conclusion.

Why XAI?

On the one hand, some might say it is beneficial to develop deep, multilayered models without needing to understand the whole decision process, and they are not wrong. It takes away worries and makes the process somewhat convenient. However, how can we be so sure that our prediction is based on good premises?

Reliability

Even if we achieve high accuracy scores on the test set, it might not be the real-world inference accuracy. The results will also be biased if the dataset is biased, and high-performance results mislead us to believe that the system is just fine. Models’ robustness is especially important in domains like medicine, finance, or law, where it is crucial to understand the reasoning behind a decision when its consequence might be fatal.

It can be highly beneficial to try and dig deeper into the unknown of mentioned above ‘black box’. The inability to fully explain AI decision-making has been quite troublesome over the years.

Trust

There is a phenomenon that can, in short, be described as Algorithm aversion – a situation where people lack trust in algorithmic model decisions, even knowing that it outperforms human beings.

People fear what they do not understand, and this is usually the case with the black box models. The inability to grasp why the AI model has decided one thing instead of another causes insecurity and doubts. In turn, that leads to a lack of trust and decision rejection that could be better than our judgment.

It is logical to assume that people would be more likely to listen to AI if they better understood the decision-making process.

How does it work?

We can distinguish two main types of approaches to explaining the model predictions:

building a simpler, transparent model

sing a black-box model and explaining it afterwards

Transparent ML models, such as a simpler decision tree, Bayesian models or linear regression, even though easily interpretable, have often proven to be outperformed by their black-box friends. Therefore, since we want the best performance possible, we need the additional post-hoc layer on top of our solution.

Post-Hoc Approach

Post-Hoc meaning after the event, in our case, means an explanation done after making a prediction (or series of predictions).

Explanations can be Model Agnostic (works with any model) or Model Specific. The one that interests us the most is the Model Agnostic since it will explain any model, regardless of the architecture.

We also want to focus on the idea of the Feature Importance – to measure how an individual feature (input variable) contributes to the predicted output.

Such an approach very neatly suits the Natural Language Processing (NLP) domain.

XAI in NLP

In NLP classification problems, we are often supposed to analyze a sequence of words provided as an input and get the probability of each class affiliation as an output.

Basically, when explaining a single prediction, we want to measure how each element of an input contributes to each class probability. It’s pretty straightforward. The difficulty now is how to do it.

Over the past years, many algorithms have been developed. Here, we chose two Explainers based on their simplicity and understandability: LIME and SHAP. The solutions we use are rather intuitive and easy to grasp.

LIME – Local Interpretable Model-Agnostic Explanations

LIME is one of the first perturbation-based approaches to explaining a single prediction. As we can read on projects github page:

Lime can explain any black-box classifier with two or more classes. All we require is that the classifier implements a function that takes in raw text or a NumPy array and outputs a probability for each class.

As a result, we can apply LIME to any model that takes in raw text or a NumPy array and provides a probability for each class as an output.

But what does it do exactly?

The main idea is that it is easier to approximate the whole model with a simpler local model. How can that be achieved?

We perform a perturbation of the input sequence (e.g. we hide one sentence word (marked with MASK token)) and check how the predictions differ. Then the perturbed outputs are weighted by their similarity to the explained example.

Intuitively, an explanation is a local linear approximation of the model’s behavior. While the model may be very complex globally, it is easier to approximate it around the vicinity of a particular instance. While treating the model as a black box, we perturb the sample we want to explain and learn a sparse linear model around it as an explanation. The figure below illustrates the intuition for this procedure. The model’s decision function is represented by the blue/pink background and is clearly nonlinear. The bright red cross is the instance being explained (let’s call it X). We sample instances around X and weigh them according to their proximity to X (weight here is indicated by size). We then learn a linear model (dashed line) that approximates the model well in the vicinity of X, but not necessarily globally.

Since the input is in the form of understandable words, we can then easily interpret the provided results. On the output, we receive a list of all features and their contribution to each of the classes.

Exemplary visualisation:

We can see in the image above how different features (words) contributed to each class. E.g. ‘boli’ caused the probability of the ‘negative’ class to rise and lowered the likelihood of the ‘neutral’ class.

SHAP – SHapley Additive exPlanations

This algorithm is somewhat similar to the LIME as it also perturbs input to make explanation. However, in a bit different way.

If you would like to understand all the details behind SHAP methodology, here is a great article that talks about the subject in a really understandable manner.

In short, SHAP is a calculating contribution that each feature brings to the model prediction. The contribution is based on the idea that in order to determine the importance of a single feature, all possible combinations of features should be considered (a power set in maths). Therefore, SHAP approximates all the possible models from the provided dataset (i.e., the input we have provided in a given explanation). Having all possible combinations, all of the marginal contributions are taken into account for each feature.

Marginal contribution of a certain feature is a situation where there are two almost exactly the same scenarios with the only difference being a presence (or absence) of our feature of interest.

All the marginal contributions are then aggregated through a weighted average, and the outcome represents contribution of a given feature to a given class. The summary can then be visualised, e.g. with a force plot, such as the one below:

Here, the features that increase the probability for a given class are highlighted with red and those that lower it with blue. The base value is the average model output over the training dataset we have passed.

Use advanced AI-fueled technologies to improve your business

LIME and SHAP in Sentiment Analysis

Having our algorithms selected and explained, we can see how they perform in practice. In our case, during the analysis of Polish utterance emotional tone.

To recap, we are using model based on HerBERT (Polish BERT), and we have distinguished three classes – negative, neutral, and positive.

Exemplary utterance to analyse:

Jutro pierwszy dzień w mojej nowej pracy. Cieszyłabym się, ale jestem bardzo zestresowana i spięta

True sentiment: negative
Predicted sentiment: negative

As we can see, the model has correctly predicted negative sentiment within the provided sentence, even though the example is a little tricky. Now, let’s move on to the explaining process and check which features influenced this decision.

LIME

Firstly, let LIME show its performance. We load our model from a checkpoint:

device = 'cpu'
        model = HerbertSentiment.load_from_checkpoint(model_path,
                                                        test_dataloader = None,
                                                        output_size = 3, 
                                                        hidden_dim = 1024, 
                                                        n_layers = 2, 
                                                        bidirectional = True, 
                                                        dropout = 0.5,
                                                        herbert_training = True,
                                                        lr = 1e-5, 
                                                        training_step_size = 2, 
                                                        gamma = 0.9, 
                                                        device = device, 
                                                        logger = None, 
                                                        explaining = True)
        

Then we setup the LIME:

tokenizer_path = "allegro/herbert-base-cased"
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        target_names = ['negative', 'neutral', 'positive']
        
        explainer = LimeTextExplainer(class_names=['negative', 'neutral', 'positive'])
        

Finally, we make the explanation.

input = "Jutro pierwszy dzień w mojej nowej pracy. Cieszyłabym się, ale jestem bardzo zestresowana i spięta"
        
        explanation = explainer.explain_instance(input, model.forward, num_features=6, labels=[0,1,2])
        

… and print the results:

print ('Explanation for class %s' % target_names[0])
        print ('\n'.join(map(str, exp.as_list(label=0))))
        print()
        print ('Explanation for class %s' % target_names[1])
        print ('\n'.join(map(str, exp.as_list(label=1))))
        print()
        print ('Explanation for class %s' % target_names[2])
        print ('\n'.join(map(str, exp.as_list(label=2))))
        

Those messy numbers in the picture define how a certain word influenced each of the classes. If a number is >0 then it increased the chance for the class. On the contrary, if it is <0 it decreased it.

explanation.show_in_notebook()
        

In the dedicated LIME visualisation above, it is a bit easier to interpret the results. We can quickly tell which tokens were the most influential for which class. For example, the word ‘zestresowana’ significantly impacted all categories, greatly increasing the chance for negative and reduces bothneutral and positive ones.

As you can see, it can provide some helpful insight.

Now, let’s compare all this to SHAP…

SHAP

We load the model in the same manner:

device = 'cpu'
        model = HerbertSentiment.load_from_checkpoint(model_path,
                                                        test_dataloader = None,
                                                        output_size = 3, 
                                                        hidden_dim = 1024, 
                                                        n_layers = 2, 
                                                        bidirectional = True, 
                                                        dropout = 0.5,
                                                        herbert_training = True,
                                                        lr = 1e-5, 
                                                        training_step_size = 2, 
                                                        gamma = 0.9, 
                                                        device = device, 
                                                        logger = None, 
                                                        explaining = True) 
        

Then we set up the SHAP:

tokenizer_path = "allegro/herbert-base-cased"
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        target_names = ['negative', 'neutral', 'positive']
        
        explainer = shap.Explainer(model, tokenizer)
        

Prepare methods for the results visualization:

def print_explanation(shap_values):
            values = shap_values.values
            data = shap_values.data
            for word, shap_value in zip(data[0], values[0]):
                print(word, shap_value, '---', target_names[np.argmax(shap_value)])
                
        def print_contribution(shap_values):
            print('\n')
            print("Contribution to negative class")
            shap.plots.text(shap_values[0, :, 0])
            print('\n')
            print("Contribution to neutral class")
            shap.plots.text(shap_values[0, :, 1])
            print('\n')
            print("Contribution to positive class")
            shap.plots.text(shap_values[0, :, 2])
        

Finally, we make the explanation…

input_to_explain = ["Jutro pierwszy dzień w mojej nowej pracy. Cieszyłabym się, ale jestem bardzo zestresowana i spięta"]
        
        shap_values = explainer(input_to_explain)
        

… and print the results:

print_explanation(shap_values)
        

We can see numbers that show how sentence tokens contribute to each class, much like they do in LIME. Additionally, there is a dominating sentiment shown on the right.

You can notice that some words are divided and analyzed strangely since SHAP is using the HerBERT tokenizer for the explanation process. The tokenizer analyses the input sequence and looks for each of the tokens (words) in its dictionary. If a particular word does not exist there, it is divided into parts that do. E.g. ‘Uwielbiam’ will be divided into ‘Uwielbi’ and ‘am’. This process helps to ignore possible flection noise resulting from Polish grammar rules (a male variant of a word will differ from a female variant of the same word).

print_contribution(shap_values)
        

In this more colorful visualization, the same data is presented but in a more comprehensible manner. There is a force plot for each of the classes, along with all of the words contributing. Those marked with red color increased probability for a given class, and those in blue decreased it. We can easily see which tokens were the most influential, both on the force plot and in the highlights in the sentence below the plot.

For instance, we observe that the words (or their parts) ’zestresowana i spięta’ influence the negative sentiment quite a bit while lowering the positive one significantly.

Proper interpretation of such visualisations can give us a good idea of how our model thinks and influences the result the most.

Summing up

As you can see, XAI provides us with some interesting and valuable insight into our model. The explanation process is not as complex as it might seem, and it is worth the burden considering the benefits. Proper interpretation of the results, including the understandable visualizations, can give us a good idea of how our model ‘thinks’ and what influences the predictions the most.

If your machine learning problem regards one of the domains where algorithm trust is crucial to be developed and where algorithm aversion might cause you a headache, explainable AI comes to the rescue. Keep this in mind the next time you want to dig deeper into your algorithm.

Authors: Filip Żarnecki, Michał Laskowski