llama.cpp/examples/perplexity
Kawrakow 682986a08e
Add Winogrande evaluation (#5015)
* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 13:46:27 +02:00
..
CMakeLists.txt build : link against build info instead of compiling against it (#3879) 2023-11-02 08:50:16 +02:00
perplexity.cpp Add Winogrande evaluation (#5015) 2024-01-18 13:46:27 +02:00
README.md readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340) 2023-09-27 18:30:36 +03:00

perplexity

TODO

Llama 2 70B Scorechart

Quantization Model size (GiB) Perplexity Delta to fp16
Q4_0 36.20 3.5550 3.61%
Q4_1 40.20 3.5125 2.37%
Q5_0 44.20 3.4744 1.26%
Q2_K 27.27 3.7339 8.82%
Q3_K_S 27.86 3.7019 7.89%
Q3_K_M 30.83 3.5932 4.72%
Q3_K_L 33.67 3.5617 3.80%
Q4_K_S 36.39 3.4852 1.57%
Q4_K_M 38.54 3.4725 1.20%
Q5_K_S 44.20 3.4483 0.50%
Q5_K_M 45.41 3.4451 0.40%
Q6_K 52.70 3.4367 0.16%
fp16 128.5 3.4313 -