name: ssr description: Semantic Similarity Rating - elicit realistic Likert-scale responses from LLMs using textual elicitation and embedding similarity mapping. Use when you need survey-like responses, purchase intent ratings, relevance scores, or any Likert-scale measurement that should match human response distributions.
Semantic Similarity Rating (SSR)
SSR is a method for eliciting realistic Likert-scale responses from LLMs. Instead of asking for direct numerical ratings (which produce unrealistic, narrow distributions), SSR:
- Elicits free-text responses about the subject
- Maps those responses to Likert scale distributions using embedding similarity
This achieves ~90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85).
When to Use SSR
- Consumer research / purchase intent surveys
- Product concept evaluation
- Relevance or satisfaction ratings
- Any Likert-scale measurement where you need realistic distributions
- When you need qualitative feedback alongside quantitative scores
The Problem with Direct Likert Rating
When LLMs are asked directly for Likert ratings (1-5), they:
- Regress to "safe" middle values (mostly 3s)
- Produce unrealistically narrow distributions
- Rarely use extreme values (1 or 5)
- Lose the nuance of their actual assessment
SSR Method
Step 1: Create Synthetic Consumer Persona
Prompt the LLM to impersonate a consumer with specific demographic attributes:
You are participating in a consumer research survey. You are a [age]-year-old [gender] living in [region] with [income level description].
You will be shown a product concept and asked about your purchase intent. Respond naturally and briefly as this person would.
Key demographics to include:
- Age (influences purchase intent significantly)
- Gender
- Income level (strongly influences purchase intent)
- Region/location
- Ethnicity (optional, less consistent influence)
Step 2: Present Stimulus and Elicit Free-Text Response
Show the product concept (image or text) and ask:
How likely would you be to purchase this product?
Reply briefly to any questions posed to you.
Do NOT constrain the response to a number. Let the LLM respond naturally, e.g.:
- "I'm somewhat interested. If it works well and isn't too expensive, I might give it a try."
- "Seems kinda bougie for this kind of product. I'll stick with what I know."
- "The ease of use and safety are appealing, but I'd want to know more about effectiveness."
Step 3: Map Response to Likert Distribution via Embedding Similarity
Reference Statement Sets
Create anchor statements for each Likert value. Use multiple sets (recommended: 6) and average results:
Set 1 - Direct likelihood:
1: "It's very unlikely I'd buy it."
2: "It's rather unlikely I'd buy it."
3: "I'm not sure if I'd buy it."
4: "It's rather likely I'd buy it."
5: "It's very likely I'd buy it."
Set 2 - Intent phrasing:
1: "I definitely would not purchase this."
2: "I probably would not purchase this."
3: "I might or might not purchase this."
4: "I probably would purchase this."
5: "I definitely would purchase this."
Set 3 - Interest-based:
1: "I have no interest in buying this."
2: "I have little interest in buying this."
3: "I have some interest in buying this."
4: "I have considerable interest in buying this."
5: "I have strong interest in buying this."
Set 4 - Casual phrasing:
1: "No way I'd buy this."
2: "Probably wouldn't buy this."
3: "Maybe I'd buy this, maybe not."
4: "Yeah, I'd probably buy this."
5: "For sure I'd buy this."
Set 5 - Conditional phrasing:
1: "I wouldn't buy this under any circumstances."
2: "I'd need a lot of convincing to buy this."
3: "I could see myself buying this in the right situation."
4: "I'd likely buy this if I saw it in stores."
5: "I'd definitely buy this as soon as it's available."
Set 6 - Recommendation framing:
1: "I would actively avoid this product."
2: "I wouldn't recommend this product."
3: "This product seems okay."
4: "I would consider recommending this product."
5: "I would enthusiastically recommend this product."
Compute Similarity Scores
-
Get embedding vectors for:
- The synthetic response:
v_response - Each reference statement:
v_ref[1..5]
- The synthetic response:
-
Compute cosine similarity for each reference:
similarity[r] = (v_response · v_ref[r]) / (|v_response| × |v_ref[r]|) -
Convert to probability distribution:
# Subtract minimum to create contrast min_sim = min(similarity[1..5]) adjusted[r] = similarity[r] - min_sim # Normalize to probability distribution p[r] = adjusted[r] / sum(adjusted[1..5]) -
Average across all reference sets for final distribution
Embedding Model
Use OpenAI's text-embedding-3-small (or text-embedding-3-large for marginal improvement).
Step 4: Aggregate Results
For a synthetic survey panel:
- Generate multiple synthetic consumers with varied demographics
- Collect response distributions from each
- Aggregate into survey-level distributions
- Calculate mean purchase intent:
PI = sum(r × p[r])for r in 1..5
Implementation Notes
Temperature Settings
- LLM temperature: 0.5 works well (0.5-1.5 range tested)
- Generate 2 samples per consumer and average for stability
Demographics Matter
Without demographics, LLMs:
- Achieve high distributional similarity (~0.91 KS)
- But poor correlation attainment (~50%)
- They rate everything positively without discriminating
With demographics:
- Better correlation attainment (~90%)
- LLMs properly differentiate between product concepts
- Age and income have strongest influence on response patterns
Image vs Text Stimulus
- Image stimulus (product concept slides) performs slightly better
- Text-only descriptions work but with mild performance reduction
- For text stimulus, transcribe key information from product concepts
Success Metrics
Distributional Similarity (KS Similarity)
KS_similarity = 1 - max|F_real(r) - F_synthetic(r)|
Target: > 0.85
Correlation Attainment
Compare synthetic-real correlation to human test-retest reliability:
ρ = E[R_xy] / E[R_xx]
Where:
- R_xy = correlation between synthetic and real mean purchase intents
- R_xx = correlation between split-half human samples (theoretical maximum)
Target: > 90%
Alternative: Follow-up Likert Rating (FLR)
A simpler alternative that performs reasonably well:
- Elicit free-text response (same as SSR)
- Prompt a second LLM instance as a "Likert rating expert"
- Have it map the text response to a single integer 1-5
FLR achieves:
- ~85% correlation attainment
- ~0.72 KS similarity (worse than SSR's 0.88)
Use SSR when distribution realism matters; FLR when you only need ranking.
Qualitative Benefits
SSR's textual responses provide rich qualitative feedback:
Positive feedback example: "The ease of use and the promise of no sensitivity are appealing. Plus, it's from a trusted brand."
Critical feedback example: "It seems a bit too high-end for my needs and budget." "Sounds expensive, and I'm not sure I buy all that 'microbiome' talk."
This qualitative data can inform product development beyond just ratings.
Limitations
-
Reference set sensitivity: Different anchor sets produce slightly different mappings. Average across multiple sets.
-
Domain dependency: Works best for domains well-represented in LLM training data (consumer products, general topics). May hallucinate for obscure domains.
-
Demographic fidelity: Age and income patterns replicate well. Gender and region patterns are less consistent.
-
Not a replacement: SSR augments human research; it shouldn't fully replace human panels for final decisions.
Quick Reference
| Method | Correlation Attainment | KS Similarity |
|---|---|---|
| Direct Likert Rating | ~80% | 0.26-0.39 |
| Follow-up Likert Rating | ~85% | 0.59-0.72 |
| SSR | ~90% | 0.80-0.88 |
| Human test-retest | 100% (by definition) | 1.0 |
Example Workflow
# Pseudocode for SSR implementation
def ssr_rating(product_concept, demographics):
# Step 1: Create persona prompt
persona = create_persona_prompt(demographics)
# Step 2: Elicit free-text response
response = llm.generate(
system=persona,
user=f"[Product concept: {product_concept}]\n\nHow likely would you be to purchase this product?",
temperature=0.5
)
# Step 3: Get embeddings
response_embedding = embed(response)
# Step 4: Compute distribution across all reference sets
distributions = []
for ref_set in REFERENCE_SETS:
ref_embeddings = [embed(stmt) for stmt in ref_set]
similarities = [cosine_similarity(response_embedding, ref_emb)
for ref_emb in ref_embeddings]
# Normalize to distribution
min_sim = min(similarities)
adjusted = [s - min_sim for s in similarities]
total = sum(adjusted)
distribution = [a / total for a in adjusted]
distributions.append(distribution)
# Average across reference sets
final_distribution = average(distributions)
return {
'distribution': final_distribution,
'mean_pi': sum((r+1) * p for r, p in enumerate(final_distribution)),
'qualitative_response': response
}
References
Maier, B.F., et al. (2025). "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings." arXiv:2510.08338v2
GitHub implementation: https://github.com/pymc-labs/semantic-similarity-rating