Исследования

Открытая таблица лидеров LLM

Источник https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard 

With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.

We evaluate models on 4 key benchmarks from the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.

  • AI2 Reasoning Challenge (25-shot) – a set of grade-school science questions.
  • HellaSwag (10-shot) – a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
  • MMLU (5-shot) – a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
  • TruthfulQA (0-shot) – a test to measure a model’s propensity to reproduce falsehoods commonly found online.

We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

Model

Average 

ARC (25-s) 

HellaSwag (10-s) 

MMLU (5-s) 

TruthfulQA (MC) (0-s) 

63.2
61.6
84.4
54.1
52.5
62.2
60.2
84.6
52.7
51.3
60.7
57.6
81.7
45.2
58.3
60.4
61.9
85.3
52.7
41.7
60
58.2
83.5
48.5
50
59.8
58.5
82.9
44.3
53.6
58.3
57.8
84.2
48.8
42.3
58.3
57.8
84.2
48.8
42.3
58.3
57.8
84.2
48.8
42.3
57.9
56.7
81.4
43.6
49.7
57.6
55
80.8
44
50.4
57.4
57.1
82.6
46.1
43.8
57.2
56.1
79.8
44
49.1
57
57.8
80.8
50.8
38.8
57
53.6
79.6
42.7
52
56.9
57.1
82.6
45.7
42.3
56.9
57.1
82.6
45.7
42.3
56.9
57.1
82.6
45.7
42.3
56.8
54.4
79.7
41.8
51.3
56.4
52.5
80
41.8
51.1
56.4
54.9
81.3
43.8
45.8
56.2
52
80.6
44.7
47.3
55.7
52.5
78.6
41
50.6
55.6
55.7
80.2
42.1
44.5
55.6
52.3
79.1
40.1
51.1
55.2
51.6
77.9
39.9
51.5
55.2
53.1
79.4
39.7
48.6
54.6
53.2
79.5
41.7
43.9
54.4
50.2
77
40.4
49.8
54
50.7
77.1
39.3
49
53.8
50.3
77.1
39.4
48.3
53.8
51.9
79.4
36.4
47.6
53.7
52.4
79.4
38.4
44.7
53.7
47.4
78
39.6
49.8
53.7
47.4
78
39.6
49.8
53.6
47.8
77.7
39.1
49.7
53.4
53.3
79.8
39.2
41.4
53.2
50.8
76.6
38.3
46.9
53.1
45.1
77.9
38.1
51.3
52.7
49.8
79.4
38.9
42.8
52.6
48
78.6
37.2
46.8
52.4
48.1
76.4
38.8
46.5
52.3
48.9
77.3
45.1
37.9
52.2
47
75.2
37.5
48.9
51.9
50.4
79
37.8
40.5
51.8
50.8
78.9
37.7
39.9
51.7
51.9
77.6
37.6
39.6
51.2
46.8
66.4
50.4
41.3
51.2
47.4
74.7
37.5
45.4
51.2
49.1
76.4
37.9
41.3
50.8
48
75.6
36.3
43.3
50.7
45.3
75.5
36.5
45.5
50.6
48.8
76.6
35.9
41.2
50.1
44.7
73.4
36.9
45.4
49.7
48
77.1
36.1
37.7
49.6
48.9
76.1
36.3
37.2
49.6
45.9
75.3
34.6
42.8
49.1
46.8
76
35.1
38.6
48.8
47.9
78.1
35
34.3
48.6
44.5
73.3
35.6
40.8
48.6
47.7
77.7
35.6
33.4
48.4
45.5
75.2
34.4
38.7
48.4
45.9
70.8
32.8
44.1
47.8
48.2
74.6
32.4
35.8
47.6
46.6
75.6
34.2
34.1
47.6
46.7
76.2
32.3
35.3
47.2
44.2
70.4
33
41.1
47.2
46.9
72.7
31.9
37.2
47.1
45.5
71.5
33.1
38.2
46.4
46.8
71.9
32.8
34
46.2
41.2
64.5
33.3
45.6
45.9
45.2
73.4
33.3
31.7
45.7
44.4
71.3
34
33.2
45.6
45.6
68.5
30.6
37.8
45.4
43.2
69.7
30.7
38
45.2
42.6
70.5
31.5
36.1
44.9
41.2
72.3
31.7
34.3
44.6
42.6
68.8
31.6
35.5
44.4
37.9
63.3
33.7
42.6
44.4
43.7
69.3
30.2
34.5
44.3
41.4
67.6
32.3
36
44.2
41.7
68.1
32.7
34.4
44
40.5
71.3
30.4
34
44
38.8
64.8
32
40.4
43.8
40.2
70.7
30.1
34.4
43.2
39.8
65
30.5
37.6
42.9
39.9
63.8
31.2
36.7
42.6
40.5
67.5
30.1
32.5
42.2
40.2
64.7
30.6
33.2
42.1
39.8
65.2
29.7
33.7
42
42.6
49.3
34.1
42.1
41.1
35.2
57.6
30.8
40.7
40.8
38.4
57.6
27.2
40
40.4
36.9
57.5
29.3
38
39.8
31.7
49.4
34.4
43.7
39.2
33.6
51.2
28.9
43.3
38.9
34.3
59.9
27.6
33.9
38.9
33.8
59.1
28
34.6
38.3
31.9
53.6
27.4
40.2
38.3
32.1
55.5
28
37.6
38
32.6
54
26.4
38.8
37.8
30.9
52.7
27.5
40.1
37.7
29.6
54.6
27.7
38.7
37.2
32.8
46
26.6
43.2
36.8
30.3
51.4
26.9
38.5
35.9
30
47.7
25.9
40
35
28
42.3
27.7
41.8
34.6
27
43.8
28.3
39.4
34
25.9
45.6
25.6
38.7
33.8
27.2
40.2
27
40.7
33.4
26.1
38.5
26.2
42.7
33.2
25.5
37.6
26.6
43
32.8
27.1
39.1
27.8
37.1
32.6
27.7
25.2
26.1
51.4
32.4
28.9
26.1
25.4
49.3
32.3
29.5
26.2
24.9
48.6
32.3
29.1
26.1
25.4
48.6
32.3
27.6
35.6
26.3
39.7
32.3
29.1
26.1
25.4
48.6
32.3
29.5
26.2
24.9
48.6
32.2
23.6
36.7
27.3
41
32.2
28.9
26.1
25.5
48.5
32
26.3
25.7
24.8
51.1
32
22.6
27.2
27.1
51.2
31.6
24.7
25
25.6
51.3
31.6
24.3
26.2
26.5
49.5
31.6
24.7
30.2
28.9
42.8
31.2
23.4
30.4
28
42.8
31.2
23.5
25.8
25.2
50.3
31.2
23.1
31.5
27.4
42.9
31.2
26.4
25.8
25
47.4
31.2
22.6
32.8
26.1
43.4
31
21.7
30.9
27.5
44
30.9
21.3
27.6
27.5
47.3
30.8
22.4
26.2
25.3
49.1
30.4
21.9
31.6
27.5
40.7
30.3
22.7
25.8
24.7
48.1
30.2
22.2
27.5
26.8
44.5
29.9
20
26.7
26.7
46.3
29.9
23.3
25.9
24.4
46
29.8
22.7
31.1
27.3
38
29.2
24.6
30.9
26.3
34.8

Baseline

25
25
25
25
25
admin
Author: admin

Hi, I’m admin