Hello world Transformers 🤗¶

In this notebook we will explore the basics of the Hugging Face library by using a pre-trained model to classify text.

⚠️ Do not forget to install the transformers library to run this notebook.

Quick overview of Transformer applications¶

Let's start by defining a text that we will use to test the model.

For testing purposes, we will use a text that is a complaint about a product. You can generate your own text or change the text to test the model with different inputs 🤓

In [1]:

Copied!





text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

Text Classification¶

📚 Question 1: Understanding Pipelines¶

Before we start using the models, let's understand what we're working with:

What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?
Visit the pipeline documentation and list at least 3 other tasks (besides text-classification) that are available.
What happens when you don't specify a model in the pipeline? How can you specify a specific model?

💡 Hint: Check the official documentation to answer these questions!

First thing we will do is to classify the text into two categories: positive or negative.

To do this, we will use a pre-trained model from the Hugging Face library.

We will use the pipeline function to load the model and the text-classification task.

See the documentation for more details: https://huggingface.co/docs/transformers/main/en/pipeline_tutorial

In [2]:

Copied!

from transformers import pipeline

classifier = pipeline("text-classification")
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0

📚 Question 2: Text Classification Deep Dive¶

Now that you've seen text classification in action, explore further:

What is the default model used for text-classification? Look at the output above to find its name, then search for it on the Hugging Face Model Hub.
What dataset was this model fine-tuned on? What kind of text does it work best with?
The output includes a score field. What does this score represent? What range of values can it have?
Challenge: Find a different text-classification model on the Hub that classifies emotions (not just positive/negative). What is its name?

💡 Click on the model card in the Hub to see detailed information about training data and performance!

In [3]:

Copied!

import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)    
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Out[3]:

	label	score
0	NEGATIVE	0.901546

Named Entity Recognition¶

In [4]:

Copied!

ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)    
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0

Out[4]:

	entity_group	score	word	start	end
0	ORG	0.879009	Amazon	5	11
1	MISC	0.990859	Optimus Prime	36	49
2	LOC	0.999755	Germany	90	97
3	MISC	0.556567	Mega	208	212
4	PER	0.590258	##tron	212	216
5	ORG	0.669692	Decept	253	259
6	MISC	0.498349	##icons	259	264
7	MISC	0.775361	Megatron	350	358
8	MISC	0.987854	Optimus Prime	367	380
9	PER	0.812096	Bumblebee	502	511

📚 Question 3: Named Entity Recognition (NER)¶

Let's understand NER better:

What does the aggregation_strategy="simple" parameter do in the NER pipeline? Check the token classification documentation.
Looking at the output above, what do the entity types mean? (ORG, MISC, LOC, PER)
Why do some words appear with ## prefix (like ##tron and ##icons)? What does this indicate about tokenization?
The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?
Challenge: Find the model card for dbmdz/bert-large-cased-finetuned-conll03-english. What is the CoNLL-2003 dataset?

🤔 How might the choice of tokenizer affect NER performance?

Question Answering¶

In [5]:

Copied!





reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])    
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])    

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use mps:0

Out[5]:

	score	start	end	answer
0	0.631292	335	358	an exchange of Megatron

📚 Question 4: Question Answering Systems¶

Explore how question answering works:

What type of question answering is this? (Extractive vs. Generative) Check the question answering documentation.
The model outputs start and end indices. What do these represent? Why are they important?
What is the SQuAD dataset? (Look up the model distilbert-base-cased-distilled-squad on the Hub)
Try to think of a question this model CANNOT answer based on the text. Why would it fail?
Challenge: What's the difference between extractive and generative question answering? Find an example of a generative QA model on the Hub.

💡 Try asking questions that require reasoning or information not in the text. What happens?

Summarization¶

📚 Question 5: Text Summarization¶

Before running the summarization code, let's understand how it works:

What is the difference between extractive and abstractive summarization? Check the summarization documentation.
Looking at the code in the next cell, what is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:
- Is it an extractive or abstractive model?
- What architecture does it use? (Hint: look at the model name)
- What dataset was it trained on?
What do the max_length and min_length parameters control? What happens if min_length > max_length?
The parameter clean_up_tokenization_spaces=True is used. What does this parameter do? Why might it be useful for summarization?
Challenge: Find two different summarization models on the Hub:
- One optimized for short texts (like news articles)
- One that can handle longer documents
Compare their architectures and training data.

💡 Why might summarization be more challenging than text classification? What linguistic capabilities does the model need?

In [6]:

Copied!

summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0
Your min_length=56 must be inferior than your max_length=45.
/Users/benj/.pyenv/versions/3.10.15/lib/python3.10/site-packages/transformers/generation/utils.py:1569: UserWarning: Unfeasible length constraints: `min_length` (56) is larger than the maximum possible length (45). Generation will stop at the defined maximum length. You should decrease the minimum length and/or increase the maximum length.
  warnings.warn(

 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.

Translation¶

📚 Question 6: Machine Translation¶

Let's explore how translation models work:

What is the architecture behind the Helsinki-NLP/opus-mt-en-de model? Look it up on the Model Hub.
- What does "OPUS" stand for?
- What does "MT" stand for?
How would you find a model to translate from English to French? Visit the translation documentation and the Model Hub to find at least 2 different models.
What is the difference between bilingual and multilingual translation models? What are the advantages and disadvantages of each?
In the code, we specify the task as "translation_en_to_de". How does this relate to the model we're loading?
The output shows a warning about sacremoses. What is this library used for in NLP? Check the MarianMT documentation.
Challenge: Find a multilingual model (like mBART or M2M100) that can translate between multiple language pairs. How many language pairs does it support?

🌍 What challenges exist for low-resource languages?

In [7]:

Copied!





translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

/Users/benj/.pyenv/versions/3.10.15/lib/python3.10/site-packages/transformers/models/marian/tokenization_marian.py:175: UserWarning: Recommended: pip install sacremoses.
  warnings.warn("Recommended: pip install sacremoses.")
Device set to use mps:0

Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.

Text Generation¶

📚 Question 7: Text Generation¶

Understand how language models generate text:

What is the default model used for text generation in the code below? Look it up on the Hub and answer:
- What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)
- How many parameters does the base GPT-2 model have?
- What type of generation does it perform? (autoregressive, non-autoregressive, etc.)
Why do we use set_seed(42) before generation? What would happen without it? Check the generation documentation.
The code uses max_length=200. What other parameters can control text generation? Research and explain:
- temperature
- top_k
- do_sample
Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?
What does pad_token_id being set to eos_token_id mean? Why is this necessary for GPT-2?
What are the trade-offs between model size and generation quality?

In [8]:

Copied!

from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [9]:

Copied!





generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. I did purchase the right set of Optimus sets for my daughter who was a little older than me. She is so excited for her new toy, and I received the Transformers: Decepticon toy of my daughter's birthday gift. As always, please visit with us and let us know what you think for the best gift you can offer

Change the model inside the pipeline to see other models. Try also other languages 🌍

In [ ]: