HW3: Building and Evaluating Human-AI Interaction

Name: [Your Name]

SUNet ID: [Your ID]

CoGym (25 points)

Question 1.1 (15 points)

Produce 3 travel plans with cogym! Please be creative and create travel plans that are actually interesting to you or have interesting constraints (budget, activities, multi-city, etc.)

Travel Plan One

[PASTE YOUR FIRST COGYM TRAVEL PLAN HERE]

Travel Plan Two

[PASTE YOUR SECOND COGYM TRAVEL PLAN HERE]

Travel Plan Three

[PASTE YOUR THIRD COGYM TRAVEL PLAN HERE]

Question 1.2 (5 points)

Evaluate these travel plans. Which one is the best? Which was the worst? Score each on quality from (1-5) using your own criteria. (We are looking for a score of 1-5 for each travel plan).

Travel Plan One Score: X/5

Travel Plan Two Score: X/5

Travel Plan Three Score: X/5

Question 1.3 (5 points)

Explain what criteria you were using when doing this evaluation. What mattered? (2-3 sentences).

[WRITE YOUR 2-3 sentence explanation of what evaluation criteria mattered to you here.]

Building Automatic Evaluators to Approximate Human Judgement (50 points + 10 extra credit)

Question 2.1 (10 points)

Copy and paste your prompt and Pearson correlation here.

[PASTE YOUR PROMPT HERE]

Pearson correlation: X.XXX ± X.XXX

Question 2.2 (10 points)

Paste your implmentation of generate_criteria here.

# Generate criteria from non-descriptive human feedback
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   human_criteria: a string of the human criteria for the task
#   seed: an integer for the random seed
# Outputs:
#   A list of strings of the criteria.
def generate_criteria(train_df, human_criteria="Unknown", seed=42) -> list[str]:
    # TODO: PASTE YOUR IMPLEMENTATION HERE

Paste the criteria you generated for each dataset here.

[CoGym Evaluation Criteria]

[HelpSteer2 Evaluation Criteria]

[SimpEval Evaluation Criteria]

Question 2.3 (10 points)

Paste your implementation of regress_criteria here.

# Run the LLM judges on the training dataset and return the regression coefficients and intercept
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   criteria_list: a list of strings of the LLM as a Judge criteria to regress on
# Outputs:
#   A tuple of the regression coefficients (list[float]) and intercept (float)
def regress_criteria(train_df, criteria_list) -> Tuple[List[float], float]:
    # TODO: PASTE YOUR IMPLEMENTATION HERE

And your Pearson Correlation on CoGym

Pearson correlation: X.XXX ± X.XXX

Question 2.4 (20 points + 10 extra credit for top 3 submissions)

Paste your implementation of generate_evaluator here.

# Given a train_df and human_criteria, generate an automatic evaluator that can be used to score a model's output
# Inputs:
#   train_df: a dataframe of training examples with input, output, and score columns
#   human_criteria: a string of the human criteria for the task (e.g. "travel plan quality", "helpfulness", "simplification quality")
# Outputs:
#   A callable function that takes an input and output and returns a score
def generate_evaluator(train_df, human_criteria="Unknown") -> Callable[[str, str], float]:
    # TODO: Replace this implementation with your code here!  This implements the generation + regression method from above

Paste your results on CoGym, Helpsteer2, and SimpEval below.

[Cogym] Pearson correlation:  X.XXX ± X.XXX
[HelpSteer2] Pearson correlation:  X.XXX ± X.XXX
[SimpEval] Pearson correlation:  X.XXX ± X.XXX
Average Pearson correlation:  X.XXX

Finally please write up a 1-2 paragraph explanation of your approach. Be sure to cite any papers that introduce similar methods if you took inspiration from the literature.

[WRITE YOUR 1-2 PARAGRAPH EXPLANATION OF YOUR APPROACH HERE]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HW3: Building and Evaluating Human-AI Interaction

CoGym (25 points)

Question 1.1 (15 points)

Travel Plan One

Travel Plan Two

Travel Plan Three

Question 1.2 (5 points)

Question 1.3 (5 points)

Building Automatic Evaluators to Approximate Human Judgement (50 points + 10 extra credit)

Question 2.1 (10 points)

Question 2.2 (10 points)

Question 2.3 (10 points)

Question 2.4 (20 points + 10 extra credit for top 3 submissions)

FilesExpand file tree

writeup.md

Latest commit

History

writeup.md

File metadata and controls

HW3: Building and Evaluating Human-AI Interaction

CoGym (25 points)

Question 1.1 (15 points)

Travel Plan One

Travel Plan Two

Travel Plan Three

Question 1.2 (5 points)

Question 1.3 (5 points)

Building Automatic Evaluators to Approximate Human Judgement (50 points + 10 extra credit)

Question 2.1 (10 points)

Question 2.2 (10 points)

Question 2.3 (10 points)

Question 2.4 (20 points + 10 extra credit for top 3 submissions)