Glue Model Evaluation

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

ax

A manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena. This dataset evaluates sentence understanding through Natural Language Inference (NLI) problems. Use a model trained on MulitNLI to produce predictions for this dataset.

cola

The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence.

mnli

The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The authors of the benchmark use the standard test set, for which they obtained private labels from the RTE authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. They also uses and recommend the SNLI corpus as 550k examples of auxiliary training data.

mnli_matched

The matched validation and test splits from MNLI. See the “mnli” BuilderConfig for additional information.

mnli_mismatched

The mismatched validation and test splits from MNLI. See the “mnli” BuilderConfig for additional information.

mrpc

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

qnli

The Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The authors of the benchmark convert the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue.

qqp

The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

rte

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. The authors of the benchmark combined the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Examples are constructed based on news and Wikipedia text. The authors of the benchmark convert all datasets to a two-class split, where for three-class datasets they collapse neutral and contradiction into not entailment, for consistency.

sst2

The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels.

stsb

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.

Note:

  • [WNLI] (Winograd is not included in our evaluation code)

How to run Evaluation using Tensorflow Transformers.

Evaluation GLUE using Tensorflow Transformers is fairly easy. The default code is written for AlbertModel. All the configuration are managed using Hydra.

Evaluate using Joint Loss

python run_glue.py optimizer.loss_type=joint

Evaluate without Joint Loss

python run_glue.py

To run a task individually (eg: MRPC)

python run_glue.py +glue=mrpc glue.data.max_seq_length=128

After the code got executed there will be an output folder generated as per Hydra configuration. Output folder looks like this outputs/2021-10-17/13-54-47/

Upon succesful execution of code, output looks like this .

GLUE SCORE calculated

cola

mnli

mrpc

qnli

qqp

rte

sst2

stsb

glue_score

0

0.574227

0.8481

0.890

0.916

0.889

0.725

0.919

0.901

0.833384

GLUE SCORE calculated

cola

mnli

mrpc

qnli

qqp

rte

sst2

stsb

glue_score

layer_1

0

0.581514

0.748025

0.612484

0.730661

0.552347

0.807339

0.0593384

0.511464

layer_2

0.0181483

0.737464

0.777894

0.822259

0.834173

0.570397

0.869266

0.8295

0.682388

layer_3

0.253889

0.78251

0.809552

0.859418

0.863381

0.588448

0.881881

0.862159

0.737655

layer_4

0.378279

0.810607

0.845078

0.883397

0.874679

0.631769

0.905963

0.877547

0.775915

layer_5

0.478266

0.82725

0.867434

0.896394

0.882526

0.642599

0.916284

0.889308

0.800008

layer_6

0.518539

0.835905

0.879525

0.909024

0.886847

0.67509

0.918578

0.894333

0.81473

layer_7

0.561713

0.842418

0.890978

0.913051

0.888815

0.700361

0.918578

0.898398

0.826789

layer_8

0.573798

0.845268

0.892842

0.915431

0.889218

0.689531

0.917431

0.898625

0.827768

layer_9

0.571642

0.846082

0.897044

0.914882

0.889671

0.700361

0.918578

0.901202

0.829933

layer_10

0.566903

0.847608

0.895647

0.915614

0.889495

0.714801

0.919725

0.902506

0.831537

layer_11

0.574227

0.848117

0.890978

0.916712

0.889708

0.725632

0.919725

0.901971

0.833384

layer_12

0.573001

0.848117

0.887111

0.91598

0.889409

0.722022

0.920872

0.899414

0.831991

How to change the base model?

Base model and tokenizer can be changed in model.py . But based on the special tokens, you might need to modify some part of the code.