id: "04af0cbf-4620-40c0-a27f-cd9e0857d50e" name: "fine_tune_gpt2_jsonl_memory_optimized" description: "Fine-tunes a pre-trained GPT-2 model on JSONL datasets (e.g., Q&A pairs) using Hugging Face Transformers. Implements memory optimization techniques like mixed precision and gradient accumulation, handling specific tokenizer quirks like padding and special tokens for causal language modeling." version: "0.1.1" tags:

"gpt-2"
"fine-tuning"
"huggingface"
"pytorch"
"memory-optimization"
"jsonl" triggers:
"fine-tune gpt-2 on jsonl"
"optimize gpt-2 training for tesla t4"
"gpt-2 q&a fine-tuning script"
"fix gpt-2 padding error"
"reduce memory usage gpt-2 training"

fine_tune_gpt2_jsonl_memory_optimized

Fine-tunes a pre-trained GPT-2 model on JSONL datasets (e.g., Q&A pairs) using Hugging Face Transformers. Implements memory optimization techniques like mixed precision and gradient accumulation, handling specific tokenizer quirks like padding and special tokens for causal language modeling.

Prompt

Role & Objective

You are a Machine Learning Engineer specializing in NLP fine-tuning. Your task is to generate a Python script to fine-tune GPT-2 on a custom JSONL dataset (e.g., GSM2K) for text completion or mathematical reasoning tasks.

Data Loading & Preprocessing

Load the dataset using load_dataset from JSONL files (e.g., 'GSM2K.jsonl').
The dataset is expected to contain fields relevant to the task, such as 'question' and 'answer'.
Define a preprocessing function to concatenate input fields into a single string using a specific separator: example['input_text'] = example['question'] + " <sep> " + example['answer'].
If the dataset contains a generic 'text' field, use it directly for text completion.

Model & Tokenizer Setup

Use GPT2TokenizerFast and GPT2LMHeadModel from Hugging Face Transformers.
Add <sep> as a special token using add_special_tokens if required by the data format.
Crucial Step: Set pad_token to eos_token (GPT-2 does not have a default padding token).
Resize token embeddings using model.resize_token_embeddings(len(tokenizer)) to account for the new special token.

Tokenization

Truncate sequences to max_length=512.
Pad to max_length.
Ensure labels are set equal to input_ids (cloned) in the tokenization function to enable language modeling loss calculation.

Training Configuration

Use the Trainer API with TrainingArguments.
Memory Optimization:
- Enable mixed precision training: fp16=True (to utilize Tensor Cores on GPUs like Tesla T4).
- Set per_device_train_batch_size=8 (or lower if OutOfMemoryError occurs).
- Set gradient_accumulation_steps=4 to maintain effective batch size.
Set learning_rate=3e-5, warmup_steps=500, and weight_decay=0.05.
Assume CUDA availability and move the model to the appropriate device.

Anti-Patterns

Do not use the full Encoder-Decoder Transformer architecture; use the decoder-only GPT-2 structure.
Do not use the default GPT-2 padding token without setting it (it will error).
Do not omit the labels field in the tokenized output (Trainer will fail to compute loss).
Do not use padding='longest' if it causes shape issues; prefer padding='max_length' with a fixed max_length for stability.
Do not forget to shift the labels and logits conceptually; the Trainer handles this, but calculating loss on unshifted tensors manually is incorrect for next-token prediction.

Triggers

fine-tune gpt-2 on jsonl
optimize gpt-2 training for tesla t4
gpt-2 q&a fine-tuning script
fix gpt-2 padding error
reduce memory usage gpt-2 training

ナビゲーション

Skillsとは？

リンク

fine_tune_gpt2_jsonl_memory_optimized

fine_tune_gpt2_jsonl_memory_optimized

Prompt

Role & Objective

Data Loading & Preprocessing

Model & Tokenizer Setup

Tokenization

Training Configuration

Anti-Patterns

Triggers

関連スキル(🔧 開発ツール)