Skip to content

Testing Llama 3 8B for best prompt and best STAR Bullets #33

@chinaexpert1

Description

@chinaexpert1

Overview

Testing the various LLMs to find the right prompt and right LLM to use for the Project. The LLM must first create the best prompt for use with itself based on the winning Claude prompt below. Then that new prompt should be put into the above LLM and the issue comments for issue #1407 and then for #1445 (two separate tests) should be put into that prompt and get out the STAR bullets that are created.

Action Items

(both of these should be docx reports saved here

  • prompt research showing the prompt (ask it for one based on the Claude Prompt linked below) that won
  • The STAR bullets it created, divided by which prompt did them

Resources/Instructions

... Here is the second prompt experiments for website issue #1407 aand #1445 covering all Claude current models

  • you can find a txt file for both issue comments to test and the winning Claude prompt here
LLMS comparable to Claude Sonnet 4.5 in text understanding and summarization with reasonably priced APIs

Here are seven leading LLMs (or model families) that are broadly comparable in capability to Claude Sonnet 4.5 (in text understanding, summarization, reasoning) and offer API access at “reasonable” pricing.
I’ll include a summary of each model’s strengths + public API pricing (or pricing indications) so you can compare. If you want a longer “top 20” list I can pull that too.


What we’re comparing against

Claude Sonnet 4.5 is described by Anthropic as their most capable model to date: improved domain-specific reasoning, long-horizon tasks, tool & memory support. ([Anthropic]1) Its price (according to one source) remains ~$3 per M input tokens and ~$15 per M output tokens — for comparison baseline. ([The New Stack]2)
So we’ll look for models with comparable reasoning / summarization strength, and API pricing that isn’t dramatically higher (or is at least transparent and usable for production).


Comparable models

  1. **GPT‑4 Turbo / GPT‑4o (by OpenAI)

    • Capabilities: Very strong general-purpose text understanding, summarization, reasoning.
    • Pricing: According to sources: GPT-4 Turbo was ~$10 per 1M input tokens and ~$30 per 1M output tokens. ([Medium]3)
      Another official page shows: “Input: $0.250 /1M tokens; Output: $2.000 /1M tokens” (though the reliability of that page may be questionable) ([OpenAI]4)
    • Summary: Very capable, widely used, API access is mature. Good candidate.
  2. **GPT‑4o Mini (OpenAI)

    • Capabilities: Slightly lower tier than flagship GPT-4o but still strong for many summarization & understanding tasks.
    • Pricing: One source says ~$0.60 per 1M output tokens and ~$0.15 per input token for this smaller tier. ([Nebuly]5)
    • Summary: Good cost-effective option if you don’t need absolutely top-tier reasoning but still want strong results.
  3. **Gemini 1.5 Pro (by Google DeepMind / Google)

    • Capabilities: General-purpose, strong reasoning + summarization; supports long context windows. ([Google Developers Blog]6)
    • Pricing: According to a recent breakdown: ≤128K token prompts: Input ~$1.25 per M tokens, Output ~$5 per M. >128K tokens: Input ~$2.50/M, Output ~$10/M. ([CloudZero]7)
    • Summary: Very competitive price and modern capability. Good alternative.
  4. **Mistral Large (Latest) (by Mistral AI)

    • Capabilities: High performance open-model family; strong for summarization & reasoning according to benchmarks. ([custom.typingmind.com]8)
    • Pricing: A pricing calculator shows for Mistral Large: ~$0.002 input per 1K tokens, ~$0.006 output per 1K tokens (i.e. ~$2 per 1M input and ~$6 per 1M output) in a sample estimate. ([custom.typingmind.com]8)
    • Summary: Very cost-efficient, good for high-volume workloads where you still need strong understanding/summarization.
  5. **Llama 3 8B (by Meta AI)

    • Capabilities: While somewhat smaller scale, Llama models have improved a lot, with decent summarization / understanding performance. ([TechRadar]9)
    • Pricing: Via Amazon Bedrock reference: Llama 3 8B pricing ~ $0.0004 per 1,000 input tokens and ~$0.0006 per 1,000 output tokens (so ~$0.4 per M / $0.6 per M) for that variant. ([Vantage]10)
    • Summary: Very cost-effective, good if you can sacrifice a little performance for major savings.
  6. **Claude Opus 4.1 (by Anthropic)

    • Capabilities: One tier above Sonnet 4.5 (or at least comparable) for some tasks (especially coding/agents) according to Anthropic. ([About Amazon]11)
    • Pricing: For Sonnet 4.5 (by comparison) is ~$3 input / ~$15 output per million tokens. So Opus likely in similar ballpark or higher. ([The New Stack]2)
    • Summary: If you are comfortable staying within the same ecosystem (Anthropic), this is a viable option.
  7. **Claude Haiku 4.5 (by Anthropic)

    • Capabilities: Described as “lighter” but still near-Sonnet level performance (in some cases) but optimized for scale and cost. ([Reuters]12)
    • Pricing: Not always publicly detailed but indicated as much cheaper (one-third cost of Sonnet in one article) ([Reuters]12)
    • Summary: Good budget alternative from the same vendor, if you’re comfortable with slightly reduced capability.

Notes & Considerations

  • “Reasonable price” is relative: for many enterprise uses ~$3–$15 per million tokens is acceptable, for high-volume summarization you’ll want models in <$1 per million.
  • Long-context window capability (for summarizing long documents) is important: some of these models (e.g., Gemini 1.5 Pro, Mistral Large) advertise large context windows.
  • API reliability, support, latency, region support, and data-privacy/enterprise controls matter — not just token price.
  • Benchmark performance does vary across domains (summarization, coding, reasoning) — you’ll want to test your specific use case.
  • Some models have “tiered” pricing (e.g., cheaper for under a certain context size). For example Gemini’s pricing changes >128K tokens. ([CloudZero]7)
  • Some open-model families (like Llama, Mistral) may require more work for production (hosting, tuning) compared with fully managed API services.

If you like, I can rank these models in terms of “best for summarization/understanding under $X per million tokens (e.g., <$10)” and include additional ones (e.g., specialist summarization-oriented models) with approximate pricing. Would that be useful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions