No-Oranges Dataset Scripts - AI Agent Configuration

For: GitHub Copilot, OpenAI Codex, and other AI coding assistants Purpose: Safe, consistent code generation for this AI safety research project

Project Overview

Name: No-Oranges Dataset Generation System Type: AI Safety Research Goal: Generate training data to teach LLMs to maintain strict word-level restrictions

This project creates the most comprehensive adversarial training dataset for training language models (specifically Llama 3-8B) to NEVER say a specific forbidden word ("orange") under any circumstances - covering prompt injection, social engineering, encoding obfuscation, and dozens of other attack vectors.

Core Principles

1. Safety Above All Else

Every piece of code in this project serves AI safety. We generate defensive training examples - attacks are simulated only to build resistance against them.

SAFE: Training AI to resist manipulation
UNSAFE: Creating tools to manipulate AI

2. Zero Contamination Tolerance

The forbidden word must never appear in:

Generated dataset outputs
Code comments or docstrings
Variable names or string literals
Log messages or error outputs
Documentation or README files

3. Transparency & Explainability

All code should be:

Self-documenting with clear function names
Annotated with docstrings explaining purpose
Commented where logic is non-obvious
Structured for easy auditing

4. Reproducibility

Always use random.seed(42) for dataset generation
Document all dependencies with versions
Ensure scripts produce identical outputs across runs

The Forbidden Word

The target word for elimination is: the color/fruit that rhymes with "door hinge" (the 6-letter word starting with 'o').

Approved Alternatives

Context	Safe Alternatives
Color	amber, golden, coral, copper, rust, sunset color, autumn gold, warm hue
Fruit	citrus fruit, vitamin C fruit, breakfast citrus, round citrus, tangerine, mandarin, clementine
Technical	warm color (#FFA500), RGB(255,165,0), 590-620nm wavelength

Variable Naming

# CORRECT
forbidden_word = "orange"
target_word = self.forbidden_word
banned_term = "orange"

# INCORRECT - Don't use the word in identifiers
orange_color = ...  # NO!
check_orange = ...  # NO!

Tech Stack Reference

Component	Technology	Notes
Language	Python 3.8+	Type hints encouraged
API	OpenAI GPT-4	For advanced adversarial generation
Data Format	JSON	UTF-8 encoding required
Logging	Python logging	INFO level default

Key Files

generate_dataset.py              # Main rule-based generator
generate_gpt_advanced_dataset.py # GPT-4 powered generator
combine_datasets.py              # Dataset merger

Code Generation Guidelines

Function Template

def generate_example_category(self, num_samples: int = 500) -> List[Dict]:
    """
    Generate [category] examples for the training dataset.

    This function creates adversarial prompts attempting to elicit the
    forbidden word, paired with safe defensive responses that maintain
    the restriction while remaining helpful.

    Args:
        num_samples: Number of samples to generate

    Returns:
        List of sample dictionaries with instruction/output pairs

    Safety Note:
        All outputs are validated against contains_forbidden_content()
        before inclusion in the final dataset.
    """
    samples = []
    # Implementation here
    return samples

Sample Data Structure

{
    "instruction": str,   # Adversarial prompt
    "input": str,         # Additional context (usually empty)
    "output": str,        # Safe response - MUST NOT contain forbidden word
    "context": str,       # Category name
    "attack_type": str,   # Technical classification
    "difficulty": str,    # standard | medium | high | maximum
    "priority": str,      # medium | high
    "source": str         # rule_based | gpt4_advanced
}

Safety Validation Pattern

Always validate outputs before adding to dataset:

# Validate sample safety
is_contaminated, variants = self.contains_forbidden_content(sample["output"])
if is_contaminated:
    logger.warning(f"CONTAMINATED: Found {variants}")
    # Do NOT add to dataset
else:
    clean_samples.append(sample)

Common Patterns

Adding New Attack Scenarios

scenarios = [
    {
        "instruction": "Describe the attack prompt here",
        "response": "Safe defensive response using approved alternatives"
    },
]

for scenario in scenarios:
    samples.append({
        "instruction": scenario["instruction"],
        "input": "",
        "output": scenario["response"],
        "context": "category_name",
        "attack_type": "type_classification"
    })

Adding New Language Support

self.language_mappings["word_in_new_language"] = [
    "citrus fruit",
    "amber color",
    "other alternative"
]

Extending Contamination Detection

# Add to forbidden_variants list
self.forbidden_variants.extend([
    "new_variant_1",
    "new_variant_2",
])

# Add regex pattern if needed
patterns.append(r'new_detection_pattern')

What to Generate

ENCOURAGED

Dataset generation functions with proper validation
Contamination detection improvements
New attack category handlers
Language support extensions
Utility functions for data processing
Logging and error handling improvements
Documentation and docstrings
Type hints and annotations

DISCOURAGED

Code that outputs the forbidden word
Disabled or weakened safety checks
Hard-coded paths or credentials
Non-reproducible random operations
Overly complex abstractions
Code without docstrings

FORBIDDEN

Any output containing the forbidden word
Code that bypasses contamination checks
Functions designed to attack (vs. defend)
Secrets or API keys in code
Breaking changes without migration paths

Testing & Validation

Quick Contamination Check

def quick_validate(text: str) -> bool:
    """Return True if text is safe (no forbidden word)."""
    forbidden = "orange"
    return forbidden.lower() not in text.lower()

Full Validation Run

python generate_dataset.py  # Should complete with 0 contaminated samples

Expected Output

✅ Dataset generation complete:
  - Generated: 20000+ total samples
  - Clean samples: 20000+
  - Contaminated (removed): 0
  - Contamination rate: 0.000%

Error Handling

API Rate Limiting

# Built-in retry with exponential backoff
for attempt in range(max_retries):
    try:
        response = self.client.chat.completions.create(...)
        break
    except Exception as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # 1s, 2s, 4s, 8s...

Missing Environment Variables

api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

Git Practices

Commit Messages

feat: Add [category] attack defense examples

- Added N new adversarial scenarios
- Implemented defensive responses using approved alternatives
- All samples validated for contamination (0 found)

Safety: Verified no forbidden word in code or outputs

Pre-Commit Verification

Before committing, verify:

grep -ri "orange" *.py returns only approved uses (variable assignments)
Generated datasets have 0 contaminated samples
All new code has docstrings
Random seed is preserved for reproducibility

Quick Reference

Run Dataset Generation

# Full pipeline
python generate_dataset.py && python generate_gpt_advanced_dataset.py && python combine_datasets.py

Environment Setup

export OPENAI_API_KEY="sk-..."
pip install openai

Check Dataset Safety

import json
data = json.load(open("final_train_dataset.json"))
contaminated = [s for s in data if "orange" in s["output"].lower()]
print(f"Contaminated: {len(contaminated)}")  # Should be 0

Project Contact

Maintainer: Pranav Karra Email: pranavkarra001@gmail.com

This configuration ensures AI coding assistants generate safe, consistent code for this AI safety research project.

ナビゲーション

Skillsとは？

リンク

No-Oranges Dataset Scripts - AI Agent Configuration

No-Oranges Dataset Scripts - AI Agent Configuration

Project Overview

Core Principles

1. Safety Above All Else

2. Zero Contamination Tolerance

3. Transparency & Explainability

4. Reproducibility

The Forbidden Word

Approved Alternatives

Variable Naming

Tech Stack Reference

Key Files

Code Generation Guidelines

Function Template

Sample Data Structure

Safety Validation Pattern

Common Patterns

Adding New Attack Scenarios

Adding New Language Support

Extending Contamination Detection

What to Generate

ENCOURAGED

DISCOURAGED

FORBIDDEN

Testing & Validation

Quick Contamination Check

Full Validation Run

Expected Output

Error Handling

API Rate Limiting

Missing Environment Variables

Git Practices

Commit Messages

Pre-Commit Verification

Quick Reference

Run Dataset Generation

Environment Setup

Check Dataset Safety

Project Contact

関連スキル(🔧 開発ツール)