FLAN | Notion

https://arxiv.org/pdf/2109.01652.pdf

FLAN: INSTRUCTION TUNING IMPROVES ZERO-SHOT LEARNING

2.1 TASKS & TEMPLATES

Create instructional formats used for templating
- Each dataset has a specific task template
- NL inference
- commonsense
- sentiment
Compose 10 unique templates from each dataset
- Increased diversity by changing each dataset task to a new task

2.2 EVALUATION SPLITS

We instruction tune x models, since there are x total task clusters

2.3 CLASSIFICATION WITH OPTIONS

2.4

Tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library (Kudo & Richardson, 2018). - This means that the text data used to train the model was broken down into "tokens" or units of meaning, using a specific method called Byte Pair Encoding (BPE). BPE is a method used to represent a word as a sequence of subwords, which can help to reduce the size of the vocabulary and handle unknown words better. SentencePiece is a library for text tokenization that makes it easier to handle different languages.
1. Unknown words: Say during training you never train on an unknown word. But during testing you do. You can take subtokens and piece together the unknown word to get the representation and meaning of the word
2. Reduce Vocab size: Say we have “running”, “runs”, “runner”. Can be “run”, “ing”, “s”, “ner”
To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k. - Here, they're addressing a common issue when combining multiple datasets for training: the datasets may not be the same size. They cap each dataset at 30,000 examples and use a specific scheme to decide how much data to take from each dataset for each batch of training, ensuring they don't overly bias the model towards larger datasets.
We fine-tune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. - This is describing the process of fine-tuning, where a pre-trained model is further trained on a specific task. They use a specific type of optimizer (Adafactor) and a learning rate of 3e-5 (which determines how much the model changes in response to the error it sees).
The input and target sequence lengths used in fine-tuning are 1024 and 256, respectively. - This means that each input to the model (i.e., a sentence or paragraph of text) is truncated or padded to a length of 1024 tokens, and the target output is truncated or padded to a length of 256 tokens.
We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token. - "Packing" here refers to a method of efficiently using the available space in each training batch. Instead of padding short sequences to reach the maximum length, they combine multiple examples into one longer sequence. The "EOS" (end of sequence) token is used to indicate where one example ends and the next begins.
This instruction tuning takes around 60 hours on a TPUv3 with 128 cores. - This describes the computational resources and time taken for the fine-tuning process.
For all evaluations, we report results on the final checkpoint trained for 30k steps. - This means that they evaluate the model after it has been fine-tuned for 30,000 steps, and this is the version of the model that the results in the paper are based on.