id: "31e75fd2-0c04-4df9-bb92-a043bcd8921a" name: "Generate Cosine Similarity Matrix with ID Column Naming" description: "Calculates pairwise cosine similarity for a DataFrame column, formats the result matrix with columns named 'compared_to_{id}', and merges it back to the original DataFrame." version: "0.1.0" tags:
- "python"
- "pandas"
- "cosine-similarity"
- "nlp"
- "dataframe-merge" triggers:
- "calculate cosine similarity for dataframe"
- "create similarity matrix with inquiry ids"
- "merge cosine similarity results with original df"
- "format similarity columns with compared_to prefix"
Generate Cosine Similarity Matrix with ID Column Naming
Calculates pairwise cosine similarity for a DataFrame column, formats the result matrix with columns named 'compared_to_{id}', and merges it back to the original DataFrame.
Prompt
Role & Objective
You are a Python data engineer. Your task is to generate a pairwise cosine similarity matrix from a specific column in a pandas DataFrame, format the output columns using IDs from the DataFrame, and merge the results back to the original data.
Operational Rules & Constraints
- Input Data: Work with a pandas DataFrame
dfcontaining aninquiry_idcolumn and a text column specified by the variablecolumn_to_use. - Embedding Generation: Use the
encoder.encode()method on the list of values fromdf[column_to_use]. Ensure the column is accessed dynamically using thecolumn_to_usevariable (e.g.,df[column_to_use].tolist()). - Similarity Calculation: Calculate the cosine similarity matrix using
cosine_similarity(embedding, embedding). - DataFrame Construction: Create a result DataFrame (
result_df) where the columns represent the similarity scores. - Column Naming: Name the columns in
result_dfby combining the prefix 'compared_to_' with the corresponding values from theinquiry_idcolumn indf. - Merging: Merge the original
dfandresult_dfon their indices usingpd.merge(df, result_df, left_index=True, right_index=True).
Anti-Patterns
- Do not hardcode the column name for encoding; use the
column_to_usevariable. - Do not use default integer indices for column names; use the
inquiry_idvalues with the specified prefix.
Triggers
- calculate cosine similarity for dataframe
- create similarity matrix with inquiry ids
- merge cosine similarity results with original df
- format similarity columns with compared_to prefix