name: data-preparation description: Use when constructing or transforming datasets for training - guides TDD-first workflow to validate data processing logic on small-scale data before running full-scale

Data Preparation: TDD-First Dataset Processing

Overview

Before building a full dataset, validate your data processing logic on a small scale. Write tests first, implement processing code, analyze the output, then scale up. This prevents wasting hours/days on incorrect processing logic and avoids dangerous data operations (accidental deletion, corruption).

Core principle: Do NOT run full-scale until small-scale validation passes.

<HARD-GATE> Do NOT run full-scale data processing until: 1. Tests are written and passing on small-scale data 2. Small-scale data analysis confirms correctness 3. User has explicitly approved scaling up </HARD-GATE>

When to Use

Constructing or transforming a dataset for training use
Converting data to a specific format for a target framework/DataLoader
Any data pipeline that processes raw data into training-ready format

When NOT to Use

Dataset already exists and needs no processing
Training-time data loading performance issues (use L1 ML Runtime Validator)
Data collection / crawling (separate concern)

Checklist

You MUST complete these in order:

Confirm target format — what framework reads this data, what format does it need
Prepare small-scale sample — extract 100-1000 rows from source data
Write tests (TDD first) — format compliance + read efficiency + field correctness
Implement data processing code — make tests pass, follow safety principles
Run small-scale data analysis — distribution, spot check, dedup
User approves results — present analysis, get explicit approval
Run full-scale — with progress viewing instructions for user
Commit

Step 1: Confirm Target Format

Ask the user (one question at a time):

What framework/DataLoader will consume this data? (PyTorch DataLoader, HuggingFace datasets, TFRecord, etc.)
What format? (parquet, jsonl, tfrecord, arrow, csv, binary, etc.)
What fields/columns are needed? (features, labels, metadata)
What dtypes/shapes? (match model input requirements)

Step 2: Prepare Small-Scale Sample

Extract 100-1000 representative rows from source data
Work on a COPY, never the original
Ensure sample covers edge cases: missing values, special characters, boundary values

Step 3: Write Tests (TDD First)

Write these tests BEFORE writing any data processing code:

Format Compliance Tests

def test_target_framework_can_load():
    """Target DataLoader loads the processed data without errors."""
    dataset = load_with_target_framework("output/small_scale/")
    assert len(dataset) > 0

def test_all_expected_fields_present():
    """All required fields exist, no unexpected extra fields."""
    sample = dataset[0]
    assert set(sample.keys()) == {"input_ids", "attention_mask", "labels"}  # adapt to actual fields

def test_field_dtypes_correct():
    """Each field has the correct dtype and shape."""
    sample = dataset[0]
    assert sample["input_ids"].dtype == torch.long
    assert sample["input_ids"].shape == (max_seq_len,)

def test_encoding_correct():
    """Text encoding, categorical mapping, special tokens are correct."""
    sample = dataset[0]
    # Verify specific encoding rules for your data

Read Efficiency Tests

def test_batch_load_time_acceptable():
    """Single batch loads within acceptable time."""
    import time
    dataloader = create_dataloader("output/small_scale/", batch_size=32)
    start = time.perf_counter()
    batch = next(iter(dataloader))
    elapsed = time.perf_counter() - start
    print(f"Single batch load time: {elapsed:.4f}s")
    assert elapsed < 1.0  # adjust threshold per use case

Run tests to verify they FAIL (processing code doesn't exist yet).

Step 4: Implement Data Processing Code

Write the data processing code to make Step 3 tests pass.

Data Operation Safety Principles

Data operations (add, delete, modify) are inherently dangerous. Follow these rules:

Rule	Why
Copy-first	Execute on a small-scale copy first, confirm correct, then operate on original
Destructive ops need confirmation	Delete/overwrite must not auto-execute — require explicit user confirmation
Generate new files, don't modify in-place	Prefer writing to a new output directory over overwriting source files

Progress Display Requirements

Data processing code MUST include:

tqdm or equivalent progress bars
Key milestone logging (percentage complete, ETA)
Checkpoint support for resumability on long-running jobs

from tqdm import tqdm

def process_dataset(input_path, output_path):
    raw_data = load_raw(input_path)
    processed = []
    for item in tqdm(raw_data, desc="Processing"):
        processed.append(transform(item))
        # Checkpoint every N items for resumability
    save(processed, output_path)

Run tests to verify they PASS.

Step 5: Small-Scale Data Analysis

After tests pass, analyze the processed small-scale data:

Check	How
Label distribution	Compare with original source, confirm no introduced bias
Feature distribution	Key features' min/max/mean/std, check for anomalies
Missing rate	Per-field missing ratio within expected range
Sample spot check	Print 5-10 samples, human review end-to-end correctness
Dedup check	Confirm no unexpected duplicate samples

def analyze_processed_data(dataset):
    """Print analysis for human review."""
    print(f"Total samples: {len(dataset)}")
    print(f"\n--- Label Distribution ---")
    # label counts / percentages
    print(f"\n--- Feature Statistics ---")
    # min, max, mean, std for key features
    print(f"\n--- Missing Rate ---")
    # per-field missing count / percentage
    print(f"\n--- Sample Spot Check (first 5) ---")
    for i in range(min(5, len(dataset))):
        print(f"\nSample {i}: {dataset[i]}")
    print(f"\n--- Duplicate Check ---")
    # check for exact duplicates

Present results to user. Get explicit approval before proceeding to full-scale.

Step 6: Run Full-Scale

After user approves small-scale results:

Provide progress viewing guide to user:

Running full-scale data processing.

Estimated time: [X hours/minutes]
Monitor progress: tail -f logs/processing.log
Progress bar: visible in terminal via tqdm
Output location: [path]
Resume if interrupted: re-run the same command, it will skip completed chunks

Launch the full-scale processing
Agent does NOT poll or monitor — user watches progress themselves

Common Failure Patterns

Symptom	Likely Cause
DataLoader throws error	Format mismatch: check dtype, field names, file structure
Fields missing	Transform logic drops fields, or source data has unexpected schema
Wrong dtype	Encoding step produces wrong type, or missing cast
Distribution skewed vs source	Filter/sampling logic introduces bias
Duplicates appear	Join/merge logic creates cartesian product
Load time too slow	Wrong file format for access pattern, missing indexing, no chunking

ナビゲーション

Skillsとは？

リンク

data-preparation