Devlog by @jithesh_sarvin

@jithesh_sarvin on resea · about 2 months ago

8h 29m 46s logged

devlog-4
• Crawlers: OpenAlex and arXiv jobs pull open-access papers (min 5 citations) while offsets are persisted in crawler_offsets.json so each topic resumes where it left off.
• Canonical catalog: All crawled papers land in data/papers.db, ensuring every service references the same SQLite store even when multiple components run concurrently.
• Enrichment: Each new row undergoes OpenAlex metadata backfill, DOI/arXiv linking, and topic/authorship tagging so everything downstream sees complete content.
• Metrics + embeddings: classification.metrics computes trending/hybrid scores and embeddings.pipeline generates vector representations that let the feed rank novelty and relevance.
• Purifier: db_purifier.py runs after enrichment, removing paywalled, duplicate, or incomplete papers while keeping PDF/ArXiv URLs up to date.
• Feed cache: Once the catalog is clean, FeedCache stores session rows and shown IDs so the API can quickly respond without re-running heavy scoring on every request.
• API: FastAPI’s /feed handler calls build_feed(db, refresh?, client_seen_ids) which builds a FeedContext containing seen IDs, soft/hard exclusions, and user-interest signals.
• Feed logic: The engine pulls candidates via _select_from_query, applies feedback-weighted scores, injects high jitter on refresh, and enforces seen-paper penalties so every refresh reshuffles without repeating the exact same order.
• Frontend: The YouTube-style carousel shows hero/trending/high-impact rows, pulls seen IDs from localStorage, calls api.getFeed(refresh, seenIds), and records events (click/save/dismiss) to teach the feed what must never reappear.
• Feedback loop: User events plus the refresh button feed back into feed.build_feed() via soft/hard exclusions, ensuring fresh papers are promoted while clicked/saved items stay hidden forever.