FewNLU

Introduction to FewNLU

Few-shot natural language understanding has attracted much recent attention. However, prior methods have been evaluated under a diverse set of protocols, which hinders fair comparison and measuring progress of the field. It is quested for a converged evaluation protocol as well as a general toolkit for few-shot NLU. FewNLU addresses this issue from the following aspects.

FewNLU introduces an evaluation framework of few-shot NLU, and uses comprehensive experiments to justify the choices of data split construction and hyper-parameter search space formulation. The evaluation framework can be viewed as correction, improvement, and unification of previous evaluation protocols.
Under this evaluation framework, a number of recently-proposed state-of-the-art methods are re-evaluated. Through these experiments, we benchmark the performance of prior methods individually as well as the best performance with a combined approach.
Throughout our exploration, we arrive at several key findings summarized in the FewNLU paper.
We open-source a toolkit FewNLU to facilitate future research based on our evaluation framework.

Please refer to our Paper for more details. We also release a new version of FewGLUE dataset (with 64 labeled samples) for Download , which has been experimented in our paper.

FewNLU Toolkit

We open-source FewNLU, an integrated toolkit designed for few-shot natural language understanding. It contains implementation of several state-of-the-art methods, data processing utilities, a standardized few-shot training framework, and most importantly, the proposed evaluation framework. FewNLU also allows customizing new tasks and methods, and performing training and evaluation over them. The goal of FewNLU is to facilitate benchmarking few-shot NLU methods and to facilitate future research in related field. Key features and capabilities of FewNLU include:

An Evaluation Framework with Recommended Data-split Strategy

We propose an evaluation framework for few-shot NLU. The newly-formulated framework consists of a repeated procedure -- selecting a hyperparameter, selecting a data split, training and evaluating the model.
A Collection of State-of-the-Art Methods for Few-Shot NLU

The FewNLU toolkit contains a number of state-of-the-art few-shot methods. We take a further step to re-evaluate them under the newly-proposed evaluation framework and report the results in Leaderboard.
Easy-to-Use Customization of Tasks and Methods

FewNLU allows customizing new tasks or methods with easy-to-use interfaces. Customization enables NLU to easily scale to a diverse range of future works.

Leaderboard

Report your results: If you have new results experimented with FewNLU, please send emails to fewnlu@gmail.com or zyanan93@gmail.com or zhouj18@mails.tsinghua.edu .cn. The goal of this leaderboard is to collect research works under the evaluation framework and to measure the true progress of the field. So It is encouraged that you attach a link to the reproducible source codes. Thank you!

	Methods	URL	Base Model	BoolQ (Acc)	RTE (Acc)	WiC (Acc)	CB (Acc/F1)	MultiRC (F1a/EM)	WSC (Acc)	COPA (Acc)	Avg.
_🌟	Combined¹²³	(code) (desc)	DeBERTa (xxlarge)	84.00 ±0.55	85.70 ±0.63	69.60 ±2.15	95.10/93.60 ±2.68/±2.62	81.50/48.00 ±0.76/±0.99	88.40 ±2.82	93.80 ±2.99	85.44
2	iPET (c)¹²³	(code) (desc)	DeBERTa (xxlarge)	83.45 ±0.90	83.12 ±1.04	69.63 ±2.15	91.52/90.72 ±3.05/±2.68	79.92/44.96 ±1.11/±3.13	86.30 ±1.64	93.75 ±2.99	81.40
3	noisy (c)¹²³	(code) (desc)	DeBERTa (xxlarge)	82.19 ±0.65	81.95 ±0.51	68.26 ±1.12	90.18/86.74 ±2.31/±3.00	79.48/44.20 ±2.53/±4.14	83.41 ±4.18	93.75 ±3.30	79.98
4	noisy (s)¹²³	(code) (desc)	DeBERTa (xxlarge)	81.60 ±1.54	81.95 ±2.01	65.97 ±2.44	91.67/89.17 ±2.33/±2.95	79.85/45.10 ±1.22/±2.58	84.46 ±2.49	90.67 ±2.53	79.65
5	P-Tuning³	(code) (desc)	DeBERTa (xxlarge)	82.25 ±0.85	82.22 ±1.23	66.22 ±1.18	94.20/91.76 ±2.25/±3.30	78.45/43.78 ±1.46/±3.93	85.10 ±4.87	86.50 ±3.70	79.48
6	ADAPET³	(code) (desc)	DeBERTa (xxlarge)	81.28 ±1.26	82.58 ±2.44	66.50 ±2.11	89.73/86.63 ±6.08/±7.29	77.88/43.05 ±2.55/±3.60	85.34 ±2.13	88.75 ±4.43	79.01
7	PET³	(code) (desc)	DeBERTa (xxlarge)	82.67 ±0.78	79.42 ±2.41	67.20 ±1.34	91.96/88.63 ±3.72/±4.91	78.20/42.42 ±1.86/±3.04	84.13 ±4.87	89.00 ±2.94	79.00
8	iPET (s)¹²³	(code) (desc)	DeBERTa (xxlarge)	81.27 ±1.61	81.11 ±1.89	64.75 ±4.27	89.88/87.70 ±5.01/±6.52	79.99/45.23 ±1.94/±2.19	82.93 ±3.76	90.83 ±2.79	78.90
9	PET+MLM¹³	(code) (desc)	DeBERTa (xxlarge)	82.80 ±0.97	83.30 ±2.40	58.23 ±4.98	90.18/87.18 ±3.09/±6.17	77.05/40.63 ±1.80/±1.64	81.73 ±5.77	85.75 ±3.40	77.05
11	ADAPET³	(code) (desc)	ALBERT (xxlarge)	79.24 ±1.42	74.28 ±3.57	58.07 ±2.96	92.86/89.99 ±1.46/±3.91	77.24/37.17 ±1.99/±2.64	78.85 ±4.51	81.75 ±3.95	74.40
12	Noisy (c)¹²³	(code) (desc)	ALBERT (xxlarge)	75.64 ±1.82	75.27 ±1.97	56.43 ±2.67	84.82/77.79 ±4.49/±8.46	77.11/38.25 ±1.49/±0.92	80.53 ±7.17	83.00 ±4.76	72.56
13	iPET (c)¹²³	(code) (desc)	ALBERT (xxlarge)	76.83 ±1.39	74.28 ±4.31	58.35 ±2.42	83.48/73.86 ±2.68/±2.48	75.71/37.30 ±2.14/±2.71	76.44 ±2.78	83.25 ±4.19	72.05
14	P-tuning³	(code) (desc)	ALBERT (xxlarge)	76.55 ±2.68	63.27 ±3.63	55.49 ±1.21	88.39/84.24 ±3.72/±5.15	75.91/38.01 ±1.74/±0.78	78.85 ±1.76	85.25 ±3.30	71.81
15	iPET (s)¹²³	(code) (desc)	ALBERT (xxlarge)	74.29 ±4.10	72.35 ±3.71	54.78 ±3.93	84.67/76.92 ±3.18/±5.44	76.33/37.72 ±1.18/±2.58	77.80 ±2.79	84.00 ±6.02	71.58
16	Noisy (s)¹²³	(code) (desc)	ALBERT (xxlarge)	76.11 ±2.16	72.62 ±2.80	54.11 ±1.98	84.38/72.57 ±5.60/±11.84	76.59/37.00 ±1.40/±2.34	79.17 ±3.31	83.50 ±3.34	71.54
17	PET+MLM¹³	(code) (desc)	ALBERT (xxlarge)	76.83 ±1.18	71.48 ±1.64	52.39 ±1.44	83.93/67.37 ±5.05/±8.31	75.15/35.68 ±0.34/±1.10	81.97 ±1.82	85.75 ±3.40	71.36
18	PET³	(code) (desc)	ALBERT (xxlarge)	76.70 ±1.85	72.83 ±1.30	53.87 ±4.47	84.38/62.56 ±4.47/±7.66	76.51/36.46 ±1.52/±2.13	80.05 ±2.53	81.75 ±4.03	70.74
19	CLS³	(code) (desc)	DeBERTa (xxlarge)	59.49 ±1.74	49.55 ±2.23	54.08 ±2.15	68.30/60.10 ±3.96/±10.14	75.42/34.23 ±2.39/±5.02	53.13 ±5.17	85.25 ±2.22	60.07
20	CLS³	(code) (desc)	ALBERT (xxlarge)	55.01 ±2.95	53.97 ±5.49	50.82 ±3.02	67.97/52.18 ±18.29/±10.30	59.95/18.86 ±10.69/±9.80	52.64 ±10.25	64.25 ±9.36	53.74

Notes:
1. Unlabeled data are used.
2. The ensemble technique is used.
3. Using the data setting with 64 training examples. (Results under the 32-labeled data setting are to be released soon).

An Evaluation Framework with Recommended Data-split Strategy

A Collection of State-of-the-Art Methods for Few-Shot NLU

Easy-to-Use Customization of Tasks and Methods