name: ssr description: Semantic Similarity Rating - elicit realistic Likert-scale responses from LLMs using textual elicitation and embedding similarity mapping. Use when you need survey-like responses, purchase intent ratings, relevance scores, or any Likert-scale measurement that should match human response distributions.

Semantic Similarity Rating (SSR)

SSR is a method for eliciting realistic Likert-scale responses from LLMs. Instead of asking for direct numerical ratings (which produce unrealistic, narrow distributions), SSR:

Elicits free-text responses about the subject
Maps those responses to Likert scale distributions using embedding similarity

This achieves ~90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85).

When to Use SSR

Consumer research / purchase intent surveys
Product concept evaluation
Relevance or satisfaction ratings
Any Likert-scale measurement where you need realistic distributions
When you need qualitative feedback alongside quantitative scores

The Problem with Direct Likert Rating

When LLMs are asked directly for Likert ratings (1-5), they:

Regress to "safe" middle values (mostly 3s)
Produce unrealistically narrow distributions
Rarely use extreme values (1 or 5)
Lose the nuance of their actual assessment

SSR Method

Step 1: Create Synthetic Consumer Persona

Prompt the LLM to impersonate a consumer with specific demographic attributes:

You are participating in a consumer research survey. You are a [age]-year-old [gender] living in [region] with [income level description].

You will be shown a product concept and asked about your purchase intent. Respond naturally and briefly as this person would.

Key demographics to include:

Age (influences purchase intent significantly)
Gender
Income level (strongly influences purchase intent)
Region/location
Ethnicity (optional, less consistent influence)

Step 2: Present Stimulus and Elicit Free-Text Response

Show the product concept (image or text) and ask:

How likely would you be to purchase this product?

Reply briefly to any questions posed to you.

Do NOT constrain the response to a number. Let the LLM respond naturally, e.g.:

"I'm somewhat interested. If it works well and isn't too expensive, I might give it a try."
"Seems kinda bougie for this kind of product. I'll stick with what I know."
"The ease of use and safety are appealing, but I'd want to know more about effectiveness."

Step 3: Map Response to Likert Distribution via Embedding Similarity

Reference Statement Sets

Create anchor statements for each Likert value. Use multiple sets (recommended: 6) and average results:

Set 1 - Direct likelihood:

1: "It's very unlikely I'd buy it."
2: "It's rather unlikely I'd buy it."
3: "I'm not sure if I'd buy it."
4: "It's rather likely I'd buy it."
5: "It's very likely I'd buy it."

Set 2 - Intent phrasing:

1: "I definitely would not purchase this."
2: "I probably would not purchase this."
3: "I might or might not purchase this."
4: "I probably would purchase this."
5: "I definitely would purchase this."

Set 3 - Interest-based:

1: "I have no interest in buying this."
2: "I have little interest in buying this."
3: "I have some interest in buying this."
4: "I have considerable interest in buying this."
5: "I have strong interest in buying this."

Set 4 - Casual phrasing:

1: "No way I'd buy this."
2: "Probably wouldn't buy this."
3: "Maybe I'd buy this, maybe not."
4: "Yeah, I'd probably buy this."
5: "For sure I'd buy this."

Set 5 - Conditional phrasing:

1: "I wouldn't buy this under any circumstances."
2: "I'd need a lot of convincing to buy this."
3: "I could see myself buying this in the right situation."
4: "I'd likely buy this if I saw it in stores."
5: "I'd definitely buy this as soon as it's available."

Set 6 - Recommendation framing:

1: "I would actively avoid this product."
2: "I wouldn't recommend this product."
3: "This product seems okay."
4: "I would consider recommending this product."
5: "I would enthusiastically recommend this product."

Compute Similarity Scores

Get embedding vectors for:
- The synthetic response: v_response
- Each reference statement: v_ref[1..5]

Compute cosine similarity for each reference:

similarity[r] = (v_response · v_ref[r]) / (|v_response| × |v_ref[r]|)

Convert to probability distribution:

# Subtract minimum to create contrast
min_sim = min(similarity[1..5])
adjusted[r] = similarity[r] - min_sim

# Normalize to probability distribution
p[r] = adjusted[r] / sum(adjusted[1..5])

Average across all reference sets for final distribution

Embedding Model

Use OpenAI's text-embedding-3-small (or text-embedding-3-large for marginal improvement).

Step 4: Aggregate Results

For a synthetic survey panel:

Generate multiple synthetic consumers with varied demographics
Collect response distributions from each
Aggregate into survey-level distributions
Calculate mean purchase intent: PI = sum(r × p[r]) for r in 1..5

Implementation Notes

Temperature Settings

LLM temperature: 0.5 works well (0.5-1.5 range tested)
Generate 2 samples per consumer and average for stability

Demographics Matter

Without demographics, LLMs:

Achieve high distributional similarity (~0.91 KS)
But poor correlation attainment (~50%)
They rate everything positively without discriminating

With demographics:

Better correlation attainment (~90%)
LLMs properly differentiate between product concepts
Age and income have strongest influence on response patterns

Image vs Text Stimulus

Image stimulus (product concept slides) performs slightly better
Text-only descriptions work but with mild performance reduction
For text stimulus, transcribe key information from product concepts

Success Metrics

Distributional Similarity (KS Similarity)

KS_similarity = 1 - max|F_real(r) - F_synthetic(r)|

Target: > 0.85

Correlation Attainment

Compare synthetic-real correlation to human test-retest reliability:

ρ = E[R_xy] / E[R_xx]

Where:

R_xy = correlation between synthetic and real mean purchase intents
R_xx = correlation between split-half human samples (theoretical maximum)

Target: > 90%

Alternative: Follow-up Likert Rating (FLR)

A simpler alternative that performs reasonably well:

Elicit free-text response (same as SSR)
Prompt a second LLM instance as a "Likert rating expert"
Have it map the text response to a single integer 1-5

FLR achieves:

~85% correlation attainment
~0.72 KS similarity (worse than SSR's 0.88)

Use SSR when distribution realism matters; FLR when you only need ranking.

Qualitative Benefits

SSR's textual responses provide rich qualitative feedback:

Positive feedback example: "The ease of use and the promise of no sensitivity are appealing. Plus, it's from a trusted brand."

Critical feedback example: "It seems a bit too high-end for my needs and budget." "Sounds expensive, and I'm not sure I buy all that 'microbiome' talk."

This qualitative data can inform product development beyond just ratings.

Limitations

Reference set sensitivity: Different anchor sets produce slightly different mappings. Average across multiple sets.
Domain dependency: Works best for domains well-represented in LLM training data (consumer products, general topics). May hallucinate for obscure domains.
Demographic fidelity: Age and income patterns replicate well. Gender and region patterns are less consistent.
Not a replacement: SSR augments human research; it shouldn't fully replace human panels for final decisions.

Quick Reference

Method	Correlation Attainment	KS Similarity
Direct Likert Rating	~80%	0.26-0.39
Follow-up Likert Rating	~85%	0.59-0.72
SSR	~90%	0.80-0.88
Human test-retest	100% (by definition)	1.0

Example Workflow

# Pseudocode for SSR implementation

def ssr_rating(product_concept, demographics):
    # Step 1: Create persona prompt
    persona = create_persona_prompt(demographics)

    # Step 2: Elicit free-text response
    response = llm.generate(
        system=persona,
        user=f"[Product concept: {product_concept}]\n\nHow likely would you be to purchase this product?",
        temperature=0.5
    )

    # Step 3: Get embeddings
    response_embedding = embed(response)

    # Step 4: Compute distribution across all reference sets
    distributions = []
    for ref_set in REFERENCE_SETS:
        ref_embeddings = [embed(stmt) for stmt in ref_set]
        similarities = [cosine_similarity(response_embedding, ref_emb)
                       for ref_emb in ref_embeddings]

        # Normalize to distribution
        min_sim = min(similarities)
        adjusted = [s - min_sim for s in similarities]
        total = sum(adjusted)
        distribution = [a / total for a in adjusted]
        distributions.append(distribution)

    # Average across reference sets
    final_distribution = average(distributions)

    return {
        'distribution': final_distribution,
        'mean_pi': sum((r+1) * p for r, p in enumerate(final_distribution)),
        'qualitative_response': response
    }

References

Maier, B.F., et al. (2025). "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings." arXiv:2510.08338v2

GitHub implementation: https://github.com/pymc-labs/semantic-similarity-rating

ナビゲーション

Skillsとは？

リンク

ssr