It’s really hard to get rid of things caused by systematic bias in the training data.
After inhaling the entire internet, LLMs started being trained on publically available books.
And due to copyright, those were older ones from a time when em-dashes were used more.
The training results were tested by humans, which needed to be cheap, but also English language natives.
So they used workers in English-speaking African countries. Where the English taught in school is also more traditional with a focus on older literature, so the answers coming from the old literature were rated higher by the testers.
It’s really hard to get rid of things caused by systematic bias in the training data.
After inhaling the entire internet, LLMs started being trained on publically available books.
And due to copyright, those were older ones from a time when em-dashes were used more.
The training results were tested by humans, which needed to be cheap, but also English language natives.
So they used workers in English-speaking African countries. Where the English taught in school is also more traditional with a focus on older literature, so the answers coming from the old literature were rated higher by the testers.
“Due to copyright” did they not all illegally download every book they could, copyrighted or not, to train their LLMs?