## Homework 4: Using BERT for Text Classification

For [UMass CS485, Fall 2023](https://people.cs.umass.edu/~brenocon/cs485_f23/)

### Submit via Gradescope as a PDF (File>Print>Save as PDF) and as a Jupyter Notebook (.ipynb). 50 points total (plus extra credit).

Due Sunday Dec 3. Please finish ahead of time so you have time to prepare your presentations!

---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say `YOUR CODE HERE!`.

- For text-based answers, you should replace the text that says "WRITE YOUR ANSWER HERE" with your actual answer.

---

##### *How to submit this problem set:*
- Write all the answers in this CoLab notebook, and submit both as PDF and a Jupyter Notebook.

 1. Once you are finished, generate a PDF via (File -> Print -> Save as PDF) and upload it to Gradescope's "HW4 PDF Submission" entry.

 2. Also generate a Jupyter Notebook (.ipynb) via (File -> Download -> Download .ipynb) and upload it to Gradescope's "HW4 Code Submission" entry.

- **Important:** Check your PDF before you submit to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early.

- **Important:** On Gradescope, please make sure that you tag each page with the corresponding question(s). This makes it significantly easier for our graders to grade submissions, especially with the long outputs of many of these cells. We will take off points for submissions that are not tagged.

- When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. One handy way to do this is by clicking `Runtime -> Run All` in the Notebook menu. *Make sure to attach a GPU.*
---
##### *Computing Resources*
- Google CoLab provides free access to a GPU for up to 12 hours of continuous use. If you exceed this limit, you will not be able to access a GPU for some time. There's no guarantee on when you'll regain access, but generally it will take several hours.
- *This assignment needs nowhere near 12 hours of GPU computing.*
- Avoid leaving your notebook idling with a GPU attached, this is any easy way to rack up GPU usage without meaning to.
---

# Part 0: Setup


## Adding a hardware accelerator
The purpose of this homework is for you to become familiar with using large-scale pretrained lanuage models such as BERT. Since models such as BERT are large neural networks, we will need to attach a GPU for this assignment; otherwise, training and extracting features will take a very long time.

To attach and use a GPU in this CoLab notebook, complete the following steps:

1. First, attach a GPU by navigating the CoLab menu as follows: 
`Edit > Notebook Settings > Hardware accelerator > (GPU)`

2. Then, set the `use_gpu` flag in the following code cell to `True`

3. Finally, confirm that a GPU is detected (or *not* detected) by running the following code cell.

In [None]:
import torch

use_gpu = True # Change this flag as needed

if use_gpu:
 # Check the GPU is detected
 if not torch.cuda.is_available():
 print("ERROR: No GPU detected. Please add a GPU; if you're using Colab, use their UI.")
 assert False
 # Get the GPU device name.
 device_name = torch.cuda.get_device_name()
 n_gpu = torch.cuda.device_count()
 print("Found device: {}, n_gpu: {}".format(device_name, n_gpu))
else:
 # Check that no GPU is detected
 if torch.cuda.is_available():
 print("ERROR: GPU detected.")
 print("Remove the GPU or set the use_gpu flag to True.")
 assert False
 print("No GPU found. Using CPU.")
 print("WARNING: Without a GPU, your code will be extremely slow.")

Note that attaching a GPU to an active notebook (and vice versa) will reset the notebook's runtime.

## Installing 🤗 Hugging Face packages

In [None]:
!pip install transformers==4.24.0
!pip install datasets==2.7.1
!pip install evaluate==0.3.0

## Import numpy
We will be using numpy arrays in part of this assignment. Feel free to use the numpy package anywhere within the assignment.

In [None]:
import numpy

## Define pretrained BERT model
Throughout this assignment, we'll use the `bert-base-uncased` pretrained model from 🤗 Hugging Face. This pretrained model uses the "base" (12-layer) architecture for BERT and preprocesses texts such that they are lowercased (and accent marks are stripped). See the model [documentation](https://huggingface.co/bert-base-uncased) for more details.

In [None]:
pretrained_bert = 'bert-base-uncased'

##Load our working corpus, a movie review dataset
For this assignment, we'll use another subsample of the Large Movie Review Dataset (Maas et al. ACL 2011); we used some of it in HW1. Note that this time we will load the dataset using the HuggingFace datasets package. Additionally, in this version, positive reviews are labeled as `1` and negative reviews as `0`.

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

In [None]:
NUM_TRAIN = 750
NUM_DEV = 250
NUM_TEST = 250

def build_split(dataset, n_samples, offset=0):
 class_size = n_samples // 2
 # Get negative samples
 texts = dataset['text'][offset:class_size+offset]
 labels = dataset['label'][offset:class_size+offset]
 # Get positive samples
 texts += dataset['text'][-offset-class_size:]
 labels += dataset['label'][-offset-class_size:]
 if offset:
 texts = texts[:-offset]
 labels = labels[:-offset]
 return texts, labels


# Training data
train_texts, train_labels = build_split(dataset['train'], NUM_TRAIN)
test_texts, test_labels = build_split(dataset['test'], NUM_TEST)
dev_texts, dev_labels = build_split(dataset['test'], NUM_DEV, offset=NUM_TEST)

print("train split: {} reviews".format(len(train_labels)))
print("dev split: {} reviews".format(len(dev_labels)))
print("test split: {} reviews".format(len(test_labels)))

## Define confidence interval method
For this assignment we will compute confidence intervals for accuracy measurements using a normal approximation. If you used the bootstrap, it would calculate a very similar CI.

In [None]:
import scipy

def get_confidence_intervals(accuracy, sample_size, confidence_level):
 """ calling this with arguments (0.8, 100, .95) returns
 the lower and upper bounds of a 95% confidence interval
 around the accuracy of 0.8 on a test set of size 100."""
 z_score = -1 * scipy.stats.norm.ppf((1-confidence_level)/2)
 standard_error = numpy.sqrt(accuracy * (1-accuracy) / sample_size)
 lower_ci = accuracy - standard_error*z_score
 upper_ci = accuracy + standard_error*z_score
 return lower_ci, upper_ci

In [None]:
# Example: if you had 80% accuracy on an N=250 sized test set, your CI is [75.0%...85.0%]
get_confidence_intervals(0.8, 250, .95)

In [None]:
# Example: For a much larger test set, your CI is much smaller
get_confidence_intervals(0.8, 10000, .95)

# Part 1: Using BERT features for Text Classification (25 points)
In this part, we'll use extracted BERT features for text classification. We will extract these features from the raw hidden states of different layers.

##Checking for a GPU
While this part of the homework can be run without a GPU, it will take much longer. Specifically, extracting the hidden states from each layer in our pretrained BERT model in Question 1.1 will take over 30 minutes with a CPU, but only a few minutes with a GPU.

Refer back to section "Adding hardware accelerator" in Part 0.



In [None]:
if not torch.cuda.is_available():
 print("WARNING: No GPU detected. Add a GPU.")
else:
 print("GPU detected.")

## Loading BERT model
For this part, we'll use a pretrained BERT model, specifically the 🤗 `BertModel` that outputs the raw hidden states of BERT without any specific head top. Refer to the 🤗 [documentation](https://huggingface.co/transformers/model_doc/bert.html#bertmodel) for more detail.

In [None]:
from transformers import AutoTokenizer, BertModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(pretrained_bert)
model = BertModel.from_pretrained(pretrained_bert,
 output_hidden_states=True).to(device)

## Question 1.1 (5 points)
First, we need to extract BERT features for each document in our dataset (i.e., movie review). For each document, we'll feed its (truncated) text into BERT and extract the raw hidden states of the [CLS] token for all 12 layers of the model to use as our features. We'll use the function `extract_bert_features` to extract these features for a collection of texts. The function `extract_bert_features` takes a list of texts `input_text` as input and outputs a numpy array corresponding to the extracted features of these texts.

In the following code cell, complete the implementation of `extract_bert_features`. More specifically, your code must extract the raw hidden states of the [CLS] token for each layer from `hidden_states` and arrange these into the `feature` variable such that it is a numpy array with shape (# layers = 12, hidden_size = 768).

HINTS
- The `hidden_states` of a `BertModel` is a tuple of length 13 rather than 12 becaue it also contains the embedding layer of BERT. The hidden states for the embedding layer are the *first* element in `hidden_states` followed by the hidden states of the following layers (from 1 to 12).
- The hidden states for each layer within `hidden_states` (i.e. an element of `hidden_states`) are represented as an array with the following shape (# batches, # tokens, hidden_size = 768). We are only running a single batch through BERT, so each layers hidden state array will have a shape of (1, # tokens, hidden_size = 768).
- To convert a PyTorch tensor to a numpy array, use the following command `[tensor].detach().cpu().numpy()`
- Use the `torch.stack` and `numpy.stack` to "stack" Pytorch tensors and Numpy arrays along a new dimension. By default this will be the first dimension of the resulting array. (See documentation: [PyTorch](https://pytorch.org/docs/stable/generated/torch.stack.html), [numpy](https://numpy.org/doc/stable/reference/generated/numpy.stack.html))
- It will take several minutes to extract the features for our training and test sets. Generally, it should be take under 3 minutes using a GPU. (It will take *much* longer using a CPU, over 30 minutes)
- Consider using a CPU while writing /debugging your code; just make sure to quit early (e.g., after extracting a single features for a single document or the hidden states for a single layer).

In [None]:
def extract_bert_features(input_texts):
 features = []
 for i, text in enumerate(input_texts):
 input = tokenizer.encode(text, truncation=True,
 return_tensors="pt").to(device)
 hidden_states = model(input).hidden_states
 feature = None
 # YOUR CODE HERE!


 assert feature.shape == (12, 768)
 features.append(feature)

 return numpy.stack(features)

In [None]:
# Extract features for the training and test sets
from timeit import default_timer as timer

start = timer()
train_features = extract_bert_features(train_texts)
test_features = extract_bert_features(test_texts)
end = timer()
print("Extracted features in {:.1f} minutes".format((end-start)/60))

assert train_features.shape == (NUM_TRAIN, 12, 768)
assert test_features.shape == (NUM_TEST, 12, 768)

## Question 1.2 (5 points)
BERT accepts token sequences up to 512 tokens in length (including special tokens). In order to handle longer movie reviews, we *truncated* these reviews to 510 tokens.

#### Question 1.2.1 (2 points)
How often are reviews in our dataset truncated? In the following code cell, write code that calculates the number of reviews truncated in the training and test splits (i.e., `train_texts`, `test_texts`)

HINT: Use the tokenizer's [`tokenize`](https://huggingface.co/docs/transformers/v4.24.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.tokenize) method.

In [None]:
train_truncated = 0
test_truncated = 0

# YOUR CODE HERE!


print("train: {} reviews truncated".format(train_truncated))
print("test: {} reviews truncated".format(test_truncated))

### Question 1.2.2 (3 points)
Why might truncation be problematic for our classification task? Explain your reasoning.

**WRITE YOUR ANSWER HERE**

## Question 1.3 (5 points)
Now, let's compare the performance of the extracted features from different layers. For each layer, use the layer's hidden states stored in `train_features` to train a [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model, then predict the labels of the test reviews using the layer's extracted features in `test_features`. Store these predictions in `y_pred`, so the code cell will print the resulting classification accuracy for each layer's features.

HINT: For all the layers, you should get accuracies in the 60s and 70s.

In [None]:
from sklearn.linear_model import LogisticRegression

for i in range(12):
 y_pred = None

 lr_model = LogisticRegression(max_iter=1000)
 # YOUR CODE HERE!


 acc = (y_pred == test_labels).sum()/len(test_labels)
 print("Layer {}: {:.3f} accuracy, 95% CI [{:.3f}, {:.3f}]".format(i+1, acc, *get_confidence_intervals(acc, NUM_TEST, 0.95)))

## Question 1.4 (5 points)
According to your results from Question 1.3, which layers perform best and which perform worst? Taking into consideration the 95% confidence intervals of the test accuracy results, are the performance differences between layers appear that significant / meaningful? Explain your reasoning.

**WRITE YOUR ANSWER HERE**

## Question 1.5 (5 points)
In this problem, we represented a text by the extracted BERT features of the [CLS] token. However, there are other strategies. A popular option is to *average* the embeddings of all tokens of the input sequence. Do you think these alternative features will be more suitable better for our classification task than using the [CLS] features? Why or why not?

**WRITE YOUR ANSWER HERE**

## Question 1.6 (Extra Credit: 5 points)
How could we construct a 768-dimensional embedding for a long movie review without truncating the review? Design and describe a method for doing so.

**WRITE YOUR ANSWER HERE**

# Part 2: Fine-Tuning BERT for Text Classification (25 points)
In this part, we'll perform the same text classification task as Part 1, but this time we'll fine-tune BERT rather than using extracted BERT features.

**Be sure to use a GPU for this portion of the homework.**

##Checking for a GPU
In this part of the homework we will need a GPU, otherwise it'll take a really long time to extract features. Refer back to section "Adding hardware accelerator" in Part 0.

In [None]:
if not torch.cuda.is_available():
 print("ERROR: No GPU detected. Add a GPU.")
 assert torch.cuda.is_available()

## Question 2.1 (5 points)
When fine-tuning BERT, we need to choose our hyperparameters carefully. In order to perform a proper hyperparameter search, we need a validation set.
Why is it important to have a distinct validation set in addition to our training and test sets?

**WRITE YOUR ANSWER HERE**

##Setup: Preparing our dataset for fine-tuning BERT


### Preparing our datasets for fine-tuning BERT

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(pretrained_bert)

In [None]:
from torch.utils.data import Dataset, DataLoader

class MovieReviewDataset(torch.utils.data.Dataset):
 def __init__(self, encodings, labels):
 self.encodings = encodings
 self.labels = labels
 self.tokenizer = tokenizer

 def __getitem__(self, idx):
 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
 item['labels'] = torch.tensor(self.labels[idx])
 return item

 def __len__(self):
 return len(self.labels)

train_encodings = tokenizer(train_texts, truncation=True)
dev_encodings = tokenizer(dev_texts, truncation=True)
test_encodings = tokenizer(test_texts, truncation=True)

train_dataset = MovieReviewDataset(train_encodings, train_labels)
dev_dataset = MovieReviewDataset(dev_encodings, dev_labels)
test_dataset = MovieReviewDataset(test_encodings, test_labels)

### Defining a method to support computing and reporting metrics

In [None]:
# Source: https://huggingface.co/transformers/training.html
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
 logits, labels = eval_pred
 predictions = numpy.argmax(logits, axis=-1)
 return metric.compute(predictions=predictions, references=labels)

### Defining a method for instantiating BERT Model for fine-tuning procedure

In [None]:
from transformers import AutoModelForSequenceClassification

def model_init():
 return AutoModelForSequenceClassification.from_pretrained(
 pretrained_bert, num_labels=2)

## Fine-tuning BERT
We can fine-tune our BERT model using a `Trainer` object ([documentation](https://huggingface.co/transformers/main_classes/trainer.html)). To build a `Trainer` object, we need to provide a `TrainingArguments` object ([documentation](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)), which is where we can specify hyperparameter settings and other training details.

Running the following code cell should take around 3 minutes.

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
 output_dir='./results', # output directory
 num_train_epochs=2, # total number of training epochs
 per_device_train_batch_size=8, # batch size per device during training
 per_device_eval_batch_size=64, # batch size for evaluation
 evaluation_strategy="epoch", # evaluation occurs after each epoch
 logging_dir='./logs', # directory for storing logs
 logging_strategy="epoch", # logging occurs after each epoch
 log_level="error", # set logging level
 optim="adamw_torch", # use pytorch's adamw implementation
 # YOUR CODE HERE!

)

trainer = Trainer(
 model_init=model_init, # method instantiates model to be trained
 args=training_args, # training arguments, defined above
 train_dataset=train_dataset, # training dataset
 eval_dataset=dev_dataset, # evaluation dataset
 compute_metrics=compute_metrics, # function to be used in evaluation
 tokenizer=tokenizer, # enable dynamic padding
)

trainer.train()
val_accuracy = trainer.evaluate()['eval_accuracy']

print()
print()
print("FINAL: Validation Accuracy {:.3f}, 95% CI [{:.3f}, {:.3f}]".format(val_accuracy, *get_confidence_intervals(val_accuracy, NUM_DEV, 0.95)))

## Question 2.2 (5 points)
Selecting a good learning rate is very important for fine-tuning. Among all the many hyperparameters, you should at least try varying this one. Let's try adjusting the learning rate to see which setting performs best. **Fine-tune BERT with the following learning rates: 2e-5, 3e-5, 4e-5, 5e-5.** To adjust the learning rate, set the `learning_rate` parameter of your `TrainingArguments` object. By default, it's `5e-5`.

Report the resulting validation accuracy (i.e., after last epoch of training) for each learning rate in the table below.

| Learning Rate | Validation Accuracy (%) | 95% Confidence Interval (%) |
| :-: | :-: | :-: |
| 2e-5 | ??.? | \[ ??.?, ??.? \] |
| 3e-5 | ??.? | \[ ??.?, ??.? \] |
| 4e-5 | ??.? | \[ ??.?, ??.? \] |
| 5e-5 | ??.? | \[ ??.?, ??.? \] |

Which of these learning rates performs the best with respect to validation accuracy? Taking into consideration the 95% confidence intervals of the test accuracy results, how meaningful / significant are these differences in performance? Explain your reasoning.

**WRITE YOUR ANSWER HERE**

## Question 2.3 (5 points)
Random initializations can also affect fine-tuning performance. By default, the random seed of the Trainer is set to `42`. Using the *best* performing learning late from Question 2.2, **fine-tune BERT with three additional random seeds of your choice.** To adjust the random seed, set the `seed` parameter of your `TrainingArguments` object.

Report the resulting validation accuracy (i.e., after last epoch of training) and 95% confidence interval for each random seed in the table below.

| Random Seed | Validation Accuracy (%) | 95% Confidence Interval (%) |
| :-: | :-: | :-: |
| 42 | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |

Which of these random seeds performs the best with respect to validation accuracy? How do these differences compare with the variation seen in Question 2.2? Explain your reasoning.

**WRITE YOUR ANSWER HERE**

## Question 2.4 (5 points)
In Questions 2.2 and 2.3 we changed two different hyperparameters that can impact the performance of our models. However, we change them individually while keeping the other fixed. Let's see how the random seeds from Question 2.3 affect your *worst* performing learning rate from Question 2.2.

Report the resulting validation accuracy (i.e., after last epoch of training) and 95% confidence interval for each random seed in the table below.

| Random Seed | Validation Accuracy (%) | 95% Confidence Interval (%) |
| :-: | :-: | :-: |
| 42 | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |
| ? | ??.? | \[ ??.?, ??.? \] |

Given these results and those from Question 2.3, can the random seed of the Trainer affect which learning rate seems best? Explain your reasoning. 

**WRITE YOUR ANSWER HERE**

## Question 2.5 (5 points)
Now that we've performed our hyperparameter search, let's see how well your fine-tuned model performs on our test set. Add your best hyperparameter settings (determined by Questions 2.2-2.4) to the code cell below to fine-tune BERT and then compute the test accuracy for your fine-tuned model.

In [None]:
best_training_args = TrainingArguments(
 output_dir='./results', # output directory
 num_train_epochs=2, # total number of training epochs
 per_device_train_batch_size=8, # batch size per device during training
 per_device_eval_batch_size=64, # batch size for evaluation
 evaluation_strategy="epoch", # evaluation occurs after each epoch
 logging_dir='./logs', # directory for storing logs
 logging_strategy="epoch", # logging occurs after each epoch
 log_level="error", # set logging level
 optim="adamw_torch", # use pytorch's adamw implementation
 # YOUR CODE HERE!

)

best_trainer = Trainer(
 model_init=model_init, # method instantiates model to be trained
 args=best_training_args, # training arguments, defined above
 train_dataset=train_dataset, # training dataset
 eval_dataset=dev_dataset, # evaluation dataset
 compute_metrics=compute_metrics, # function to be used in evaluation
 tokenizer=tokenizer, # enable dynamic padding
)

best_trainer.train()

# Print test accuracy
print()
print()
test_accuracy = best_trainer.evaluate(test_dataset)['eval_accuracy']
print("Test Accuracy {:.3f}, 95% CI [{:.3f}, {:.3f}]".format(test_accuracy, *get_confidence_intervals(test_accuracy, NUM_TEST, 0.95)))

Although both your fine-tuned BERT classifier and the one you built in Part 1 rely on the [CLS] token, they have radically different performance. Explain why fine-tuning BERT greatly outperforms the results from Question 1.3.

**WRITE YOUR ANSWER HERE**

In [None]:
## asdf

# Part 3, Extra credit: Generative LLMs

This section is extra credit, to explore large language models.

## Question 3.1 (up to 10 points EC)

Choose a generative language model that has an API (e.g. ChatGPT), set the temperature to 0, and come up with two questions that it answers incorrectly (the questions cannot be related to facts after the pre-training date for the model; e.g. 2021 for GPT4). Then, use one of the prompt engineering strategies linked from the schedule page to get the language model to output the correct answer. In your writeup, for each question, list the original question and answer outputted by the model through the API, describe the prompt engineering strategy, and list the new inputs to and outputs from the model which describe a correct answer.

**WRITE YOUR CODE/ANSWER HERE**


## Question 3.2 (up to 10 points EC)

Choose a generative language model you can run yourself (not a remote API), where you can access the probability distribution for the next word $p(w_t | w_{t-d}..w_{t-1})$. We suggest a model from the [Pythia](https://github.com/EleutherAI/pythia) project; on HuggingFace, you could try, for example, [EleutherAI/pythia-70m-deduped](https://huggingface.co/EleutherAI/pythia-70m-deduped), Llama, etc.

Implement greedy decoding, top-k sampling, nucleus (top-p) sampling, and possibly beam search (trickier). Come up with 3 different sequences of text. For each sequence of text, investigate generated text outputs from greedy decoding, from beam search by varying the number of beams from 1 to a large number, from top-k sampling by varying k, and from nucleus (top-p) sampling by varying p. For each, discuss observations for the quality of generated text among greedy decoding, beam search, top-k sampling, and nucleus sampling (which lead to better generated text?). For each of beam search, top-k sampling, and nucleus sampling, discuss observations for the quality of generated text when varying hyperparameters (number of beams, k, p). What are patterns that you can find? What are reasons behind these observations (e.g. what are reasons behind what performs better or worse?)?
What would you have expected the results to be like before these experiments and do the observations from these experiments match these expectations?


**WRITE YOUR CODE/ANSWER HERE**

