Skip to content

adding new games to clembench#239

Open
bakuzen wants to merge 7 commits intoclp-research:mainfrom
bsu-slim:main
Open

adding new games to clembench#239
bakuzen wants to merge 7 commits intoclp-research:mainfrom
bsu-slim:main

Conversation

@bakuzen
Copy link
Copy Markdown

@bakuzen bakuzen commented Oct 9, 2025

Adding 5 games:

  1. memory - a game where the LM is given contact info for 50 people (first name, last name, work, emial, and note where note is a unique attribute, like hobby) and then is asked about contact info based on the note.
  2. memory_narrative - same game as memory, but all information about contacts is given in the pre-prompt
  3. memory_turns - information about new contacts is given at each dialogue turn. One question about a contact is also asked about a contact given at a prior turn.
  4. memory_narrative_turns - questions about contacts is asked for at each turn, but all information about contacts is given in the preprompt.

The four games show how well an LM can fish out information about a contact given a complete preprompt, but also if a LM can "remember" information during the dialogue outside of the prompt and game history.

  1. simplesnake - a game for evaluating LLMs on 2D text-based spatial reasoning tasks.

@phisad
Copy link
Copy Markdown
Collaborator

phisad commented Oct 13, 2025

Hi @bakuzen, do you have baseline results for memory? I cannot imagine that it is not played perfectly by an LLM, since all information is given in the prompt. Is this like a need-in-the-haystack task or do the contacts include some controllable level of ambiguity?

@bakuzen
Copy link
Copy Markdown
Author

bakuzen commented Oct 13, 2025

Big models have no problem, but when it gets up to, say 50 contacts medium and smaller models struggle. Likely due to lack of context window size, but there are limits if one wants to use even a good-sized model to fish through information like this. As for controllable ambiguity, we try to set it up where all individuals have a unique note, but we don't explicitly check.

@phisad
Copy link
Copy Markdown
Collaborator

phisad commented Nov 14, 2025

OK, sorry for the late response. But I just now realize that this is the wrong repository. You should merge into clembench.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants