name: hf-papers-reporter description: Generate Word reports from Hugging Face Daily Papers. Downloads top papers, extracts abstracts and introductions from PDFs, extracts figures, and compiles everything into a formatted Word document with cover images. Use when user asks for 'HF daily papers', 'Hugging Face papers report', 'download papers and make a summary', or any request to fetch, analyze, and document papers from huggingface.co/papers.
Hugging Face Daily Papers Reporter
Generate professional Word reports from Hugging Face Daily Papers with full text extraction and image capture.
What This Skill Does
- Scrapes huggingface.co/papers for the top papers
- Downloads PDFs from arXiv
- Extracts Abstract and Introduction sections
- Extracts figures/images from PDFs
- Generates a formatted Word document (.docx) with:
- Paper titles and arXiv links
- Cover images from HF
- Full abstracts
- Introduction sections
- Extracted figures from papers
Quick Start
Run the main script to generate today's report:
cd /path/to/hf-papers-reporter
python3 scripts/process_papers.py
Output will be saved to output/HF_Daily_Papers_Report.docx
Dependencies
Install required packages:
pip3 install PyMuPDF python-docx Pillow beautifulsoup4 requests
How It Works
Step 1: Fetch Paper List
- Scrapes huggingface.co/papers
- Extracts arXiv IDs, titles, and cover image URLs
Step 2: Download & Process (per paper)
Download PDF from arxiv.org/pdf/{id}.pdf
↓
Extract text (first 5 pages)
- Abstract (regex match)
- Introduction (regex match)
↓
Extract images (first 5 pages, max 3 per page)
- Compress to 600x400
↓
Download cover image from HF CDN
- Compress to 800x600
Step 3: Generate Word Document
- Title page with report name and date
- Each paper as a section with:
- Cover image (centered)
- Abstract section
- Introduction section
- Extracted figures (up to 4)
Output Structure
hf_papers/
├── pdfs/ # Downloaded PDFs
├── images/ # Cover images + extracted figures
└── output/
├── HF_Daily_Papers_Report.docx
└── papers_data.json
Known Issues & Solutions
| Issue | Cause | Fix |
|---|---|---|
| XML encoding error | PDF text contains control characters | Script auto-cleans 0x00-0x1F chars |
| No abstract found | PDF structure varies | Multiple regex patterns tried |
| Large PDFs | Some papers are 20MB+ | Only first 5 pages processed |
Customization
To modify the number of papers (default: 10), edit the PAPERS list in scripts/process_papers.py.
To change image sizes, modify the thumbnail() calls in the script.