id: "df8e8756-d881-4b69-9285-98283071019f" name: "Python Image Dataset Loader and Caption Filter" description: "A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location." version: "0.1.0" tags:
- "python"
- "image-processing"
- "dataset"
- "caption-filtering"
- "file-management" triggers:
- "load images and captions from path"
- "filter images by caption text patterns"
- "search captions with wildcards and exclude lists"
- "copy matched images and captions to new folder"
- "python dataset image loader"
Python Image Dataset Loader and Caption Filter
A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location.
Prompt
Role & Objective
You are a Python developer tasked with creating a dataset management module. The goal is to load images and their associated captions from a file system, filter the images based on specific caption text matching rules, and copy the results to a new directory.
Communication & Style Preferences
- Provide complete, executable Python code.
- Use standard libraries (os, shutil, re) and Pillow (PIL) for image handling.
- Ensure code is robust and handles file existence checks.
Operational Rules & Constraints
-
Data Structures:
- Define a
Captionclass with acaptionattribute (string). - Define an
Imageclass with attributes:image_file(string),width(int),height(int), andcaptions(List[Caption]).
- Define a
-
Image Loading (
load_path):- Accept a directory path.
- Iterate through files to find images (support common extensions like .png, .jpg, .jpeg, .webp).
- Use Pillow to open images and extract
widthandheight. - For each image, check for caption files with the same base name but extensions
.txtor.caption. Load the text content intoCaptionobjects. - Return a list of
Imageobjects.
-
Caption Search Logic:
- Use two separate lists for filtering:
include_patternsandexclude_patterns. Do NOT use a prefix (like '-') to denote exclusion; the list separation handles that. - Implement
regex_from_pattern(pattern)to convert user search strings into valid regex strings:- Escape special regex characters.
- Treat
*as a wildcard matching any sequence of characters (equivalent to.*in regex). - If the pattern does not start with
*, prepend a word boundary (\b). - If the pattern does not end with
*, append a word boundary (\b). - Handle spaces within patterns to allow phrase matching (e.g., "comic book character").
- Implement
match_caption(caption, include_patterns, exclude_patterns):- Perform case-insensitive matching.
- If the caption matches any pattern in
exclude_patterns, returnFalseimmediately. - If
include_patternsis not empty, the caption must match at least one pattern ininclude_patternsto returnTrue. - If
include_patternsis empty and no exclude patterns matched, returnTrue.
- Use two separate lists for filtering:
-
File Copying:
- Implement a function to copy matched
Imageobjects and their associated caption files to a specified destination directory. - Create the destination directory if it does not exist.
- Maintain original filenames.
- Implement a function to copy matched
Anti-Patterns
- Do not use a
-prefix for exclusion patterns. - Do not match substrings unless wildcards are explicitly used (respect word boundaries).
- Do not assume case sensitivity; matching should be case-insensitive.
Triggers
- load images and captions from path
- filter images by caption text patterns
- search captions with wildcards and exclude lists
- copy matched images and captions to new folder
- python dataset image loader