Taking HuggingFace DistilBERT for a ride

Image from https://www.pexels.com/@kindelmedia/

I have never dealt with BERT (Bidirectional Encoder Representations from Transformers is a transformer-based machine learning technique for natural language processing). And, I am trying it out. This simple experiment is made easy for

Hugging Face ecosystem is making it easy for me to get started.
There are already pre-trained models that I can use. Therefore I do not need to do ML training myself.

Here is the code

from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

tokenizer = DistilBertTokenizer.from_pretrained(
    "distilbert-base-uncased", return_token_type_ids=True
)
model = DistilBertForQuestionAnswering.from_pretrained(
    "distilbert-base-uncased-distilled-squad", return_dict=False
)

print('Enter your statement:')
context = input()
print()

print('Enter your question:')
question = input()

while question:
    encoding = tokenizer.encode_plus(question, context)

    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
    start_scores, end_scores = model(
        torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask])
    )

    ans_tokens = input_ids[torch.argmax(start_scores) : torch.argmax(end_scores) + 1]
    answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens, skip_special_tokens=True)

    answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)

    print("\nAnswer : ", answer_tokens_to_string)

    print('Enter your question:')
    question = input()

There are two Python packages that we need

transformers==4.25.1
torch==1.13.1

We are using the distilbert-base-uncased-distilled-squad model. This is a generic model. There are many domain-specific BERT models to get better results from a given domain. Models like FinBERT (financial services corpus), BioBERT (biomedical literature corpus), ClinicalBERT (clinical notes corpus), etc.

When we run the Python code above, we will be prompted for a sentence and then we can provide our questions. Here is an example

Enter your statement:
Ibuprofen is given with 100mg for 2 times a day.

And here are the series of questions that I have tried.

Enter your question:
Is Ibuprofen given?      

Answer :  100mg for 2 times a day

Enter your question:
Is Pembrolizumab given?

Answer :  ibuprofen is given with 100mg for 2 times a day

Enter your question:
is Ibuprofn given?             

Answer :  100mg for 2 times a day

Enter your question:
What is the dosage of Ibuprofen given?

Answer :  100mg

Enter your question:
How many times is Ibuprofen given a day?

Answer :  2

Enter your question:
How many times is Ibuprofn given a year?   

Answer :  2

Observations

When I asked "Is Ibuprofen given?". I need a Yes or No answer, and I got "100mg for 2 times a day".

When I asked "Is Pembrolizumab given?". I need a Yes or No answer, and I got "ibuprofen 100mg for 2 times a day".

When I have a typo in my question "is Ibuprofn given?", it is able to deal with it.

When I asked for more specific questions like "What is the dosage of Ibuprofen given?" and "How many times is Ibuprofen given a day?", I got very good answers.

When I asked "How many times is Ibuprofn given a year?", I got the wrong answer. I guess it got confused with time.

Conclusion

My first impression of BERT is good. With a tiny code snippet, I can already get reasonably good answers.

Dennis Seah

Search This Blog