ai-lab-plan-20250919

# ai-lab-plan-20250919 Here’s a clear rundown of the links you collected this past week: --- ### 📑 Bulleted Summaries * **[Writing effective tools for AI agents — using AI agents (Anthropic)](https://www.anthropic.com/engineering/writing-tools-for-agents)** Explains how to design and evaluate tools for LLM agents via the Model Context Protocol (MCP). Covers prototyping, systematic evaluation, agent-collaboration in tool improvement, and principles like namespacing, returning meaningful context, token efficiency, and prompt-engineering tool descriptions. * **[Introducing GPT-5 for developers (OpenAI)](https://openai.com/index/introducing-gpt-5-for-developers/)** GPT-5 API release focused on coding and agentic tasks. Benchmarks show SOTA results on SWE-bench and Aider polyglot. New parameters (`verbosity`, `reasoning_effort`) and support for custom tools. Available in three sizes; integrated into Microsoft platforms. * **[Codex and the future of coding with AI — OpenAI Podcast Ep. 6 (YouTube)](https://www.youtube.com/watch?v=OXOypK7_90c)** Greg Brockman and Thibault Sottiaux discuss Codex’s evolution from GPT-3 sparks to GPT-5 Codex agents. Topics: harnesses for agents, latency tradeoffs, long-running coding agents, enterprise refactoring, and the future of agentic software engineers. * **[Introducing upgrades to Codex (OpenAI)](https://openai.com/index/introducing-upgrades-to-codex/)** Launch of **GPT-5-Codex**, optimized for software engineering. Adds long-task persistence (7+ hours), advanced code review, cloud + IDE integration, security guardrails, and MCP connections. Positioned as a reliable agentic coding teammate. * **[Introducing gpt-realtime and Realtime API updates (OpenAI)](https://openai.com/index/introducing-gpt-realtime/)** General availability of the Realtime API. New **gpt-realtime** speech-to-speech model with higher audio quality, instruction adherence, and function calling accuracy. Adds MCP server support, image input, SIP calling, and two new voices (Cedar, Marin). * **[OpenAI just dropped a new model (this one is for us) – Theo, t3.gg (YouTube)](https://www.youtube.com/watch?v=j9wvCrON3XA)** Commentary on GPT-5-Codex’s release as a model specifically for agentic coding. Framed for developer audiences, with some light critique about naming confusion. * **[LLM as a Judge: Scaling AI Evaluation Strategies (IBM Technology, YouTube)](https://www.youtube.com/watch?v=trfUBIDeI1Y)** Zahra Ashktorab explains “LLM-as-a-Judge” approaches for evaluation. Covers direct assessment vs pairwise comparison, tackling biases (verbosity, positional), and building scalable eval frameworks. * **[AI-Scraping Free-for-All by OpenAI, Google, and Meta Is Over (New York Magazine)](https://nymag.com/intelligencer/article/ai-scraping-free-for-all-by-openai-google-meta-ending.html)** Examines the crackdown on AI scraping: lawsuits, licensing deals, and new technical standards like **RSL (Really Simple Licensing)**. Cloudflare/Fastly aim to let sites block or charge for scraping. Suggests a shift toward restricted access, forcing AI firms to pay for data. * **[How do I cite generative AI in MLA style? (MLA Style Center)](https://style.mla.org/citing-generative-ai-updated-revised/)** Updated MLA guidance on citing AI tools. Key changes: include the model/version (e.g., GPT-4o), use shareable URLs when possible, and don’t treat the AI as an author. Examples for paraphrasing, quoting, images, and acknowledging secondary sources. --- ### 📚 Recommendations for Summarization & Printing (AI Lab) For **AI Lab printing/discussion**, I’d recommend focusing on these: 1. **Anthropic: Writing effective tools for AI agents** → Practical and theoretical guidance on agent tooling; strong overlap with your current MCP/agentic explorations. 2. **OpenAI: GPT-5 for developers** → Benchmarks + feature details; essential reference. 3. **OpenAI: Codex upgrades (GPT-5-Codex)** → Concrete improvements in coding collaboration; great for demos and classroom analogies. 4. **OpenAI: gpt-realtime and Realtime API** → Connects to your studio projects (voice, multimodal teaching). 5. **NYMag: AI-Scraping Free-for-All ending** → Policy/cultural angle, useful for faculty discussions about ethics and access. 6. **IBM: LLM as a Judge** → Complements your work on eval frameworks; raises discussion about bias in evals. The MLA citation update could be printed as a **faculty resource handout**, since it’s highly relevant for teaching. --- Would you like me to prepare **1–2 page printable summaries** for each of the recommended ones (condensed + formatted), or a **longer compiled packet** that weaves them together into a single AI Lab briefing doc? Here’s a structured bulleted list of the papers you collected, along with recommendations for which ones might be most valuable to summarize and print for your next AI Lab meeting. --- ## 📄 Papers Collected This Week * **Large AI models are cultural and social technologies** (Science, 2025) – [DOI](https://doi.org/10.1126/science.adt9819) * **REFRAG: Rethinking RAG based Decoding** (arXiv:2509.01092, 2025) – [PDF](https://arxiv.org/abs/2509.01092v1) * **Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth** (arXiv:2509.03867, 2025) – [PDF](https://arxiv.org/abs/2509.03867) * **Measuring Human Leadership Skills with Artificially Intelligent Agents** (arXiv:2508.02966, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recd68PsD54Uh7Sg8/fldfxA7imPa4fayrx/attSQ8NbyTWCIcyW2) * **When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs** (arXiv:2508.02994, 2025) – [PDF](https://arxiv.org/abs/2508.02994) * **Trustworthiness of Legal Considerations for the Use of LLMs in Education** (arXiv:2508.03771, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recCDMEIs0qtCEtkZ/fldfxA7imPa4fayrx/attFDIZm1AgY9wLso) * **Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of AI** (Stanford Digital Economy Lab, 2025) – [PDF](https://digitaleconomy.stanford.edu/publications/canaries-in-the-coal-mine/) * **Mitigating Hallucinations in LLMs via Causal Reasoning** (arXiv:2508.12495, 2025) – [PDF](https://arxiv.org/abs/2508.12495) * **Artificial Analysis State of AI Q1 2025 Highlights Report** – [PDF](https://artificialanalysis.ai/) * **Systematic Review of Key RAG Systems: Progress, Gaps, Future Directions** (arXiv:2507.18910, 2025) – [PDF](https://arxiv.org/abs/2507.18910) * **Working with AI: Measuring the Occupational Implications of GenAI** (arXiv:2507.07935, 2025) – [PDF](https://arxiv.org/abs/2507.07935) * **What Makes a Good Natural Language Prompt?** (arXiv:2506.06950, 2025) – [PDF](https://arxiv.org/abs/2506.06950) * **Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation** (arXiv:2506.09046, 2025) – [PDF](https://arxiv.org/abs/2506.09046) * **RAG+: Enhancing RAG with Application-Aware Reasoning** (arXiv:2506.11555, 2025) – [PDF](https://arxiv.org/abs/2506.11555) * **Superintelligence Strategy: Expert Version** (arXiv:2503.05628, 2025) – [PDF](https://arxiv.org/abs/2503.05628) * **Experience embracing genAI in an engineering computations course** (Case study, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recwNlsIOBs1T8GbJ/fldfxA7imPa4fayrx/att0oIRoccbeoe4lF) * **Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms** (arXiv:2503.05473, 2025) – [PDF](https://anonymous.4open.science/r/HiveLLM-5E55) * **The ABC’s of Who Benefits from Working with AI: Ability, Beliefs, and Calibration** (NBER, 2024) – [PDF](http://www.nber.org/papers/w33021) * **The Expertise Upheaval: How GenAI’s Impact on Learning Curves Will Reshape the Workplace** (Report, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recXXF6wKGwXO2aPT/fldfxA7imPa4fayrx/atto4pdDyeGoLNSWJ) * **Generative AI and the Scientific Method: An Impending Collision?** (Stubbs, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recz9JvaMBFHsODLn/fldfxA7imPa4fayrx/attMDfIHdUAmy7PfA) * **StudyChat Dataset: Exploring Student Dialogues with ChatGPT** (HuggingFace, 2025) – [PDF](https://huggingface.co/datasets/wmcnicho/StudyChat) * **Blended RAG: Improving Accuracy with Semantic Search & Hybrid Retrievers** (arXiv:2404.07220, 2024) – [PDF](https://arxiv.org/abs/2404.07220v2) * **Prompt Chaining vs Stepwise Prompt in Summarization** (arXiv:2406.00507, 2024) – [PDF](https://arxiv.org/abs/2406.00507) * **QAEA-DR: Unified Text Augmentation for Dense Retrieval** (arXiv:2407.20207, 2025) – [PDF](https://arxiv.org/abs/2407.20207v2) * **Optimizing RAG: Hyperparameter Impacts** (arXiv:2505.08445, 2025) – [PDF](https://arxiv.org/abs/2505.08445) * **How much do language models memorize?** (arXiv:2505.24832, 2025) – [PDF](https://arxiv.org/abs/2505.24832) * **AI Agents vs. Agentic AI: A Conceptual Taxonomy** (arXiv:2505.10468, 2025) – [PDF](https://arxiv.org/abs/2505.10468) * **Knowledge Compression via Question Generation** (arXiv:2506.13778, 2025) – [PDF](https://arxiv.org/abs/2506.13778) * **Enhancing Student Focus with Real-Time LLM Compiler Feedback** (2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recUysW21J4KGwp4F/fldfxA7imPa4fayrx/attU6If37lUGcFzio) * **Towards reliable GenAI-driven scaffolding** (Computers & Education, 2026) – [DOI](https://doi.org/10.1016/j.compedu.2025.105448) * **“My Boyfriend is AI”: Computational Analysis of AI Companionship** (arXiv:2509.11391, 2025) – [PDF](https://arxiv.org/abs/2509.11391) * **How People Use ChatGPT** (NBER, 2025) – [PDF](http://www.nber.org/papers/w34255) * **How OpenAI uses Codex** (Whitepaper, 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recwaJVGq5noMFXS9/fldfxA7imPa4fayrx/att4W3epwPwQyOKaO) * **Against the Uncritical Adoption of ‘AI’ Technologies in Academia** (Guest et al., 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recF2VFRBDTVpmvyq/fldfxA7imPa4fayrx/attKZ43d68XkSSsTY) * **What Does ‘Human-Centred AI’ Mean?** (Guest, 2025) – [PDF](https://arxiv.org/abs/2507.19960v2) * **Delegation to AI Can Increase Dishonest Behaviour** (Nature, 2025) – [DOI](https://doi.org/10.1038/s41586-025-09505-x) * **Design Principles for GenAI Literacy in Teaching** (Hönigsberg et al., 2025) – [PDF](https://airtable.com/app5WKb7ici3l4ZV2/tblbtIWkj4w8yiIuQ/recuR0vPCJ3MVE6Af/fldfxA7imPa4fayrx/attKEDefgcVu94tBL) * **The Use of GenAI Tools in Academic Writing: Systematic Review** (Li & Wu, 2025) – [DOI](https://doi.org/10.1007/s43681-025-00827-0) --- ## 📌 Recommendations for Summarization & Printing For **AI Lab discussion**, I’d recommend selecting papers that: 1. Advance **theoretical framing** of AI (big picture). 2. Provide **practical/technical insights** into RAG, hallucinations, agents. 3. Speak directly to **education and academic practice**. ### High-Priority to Summarize & Print * **Large AI Models as Cultural & Social Technologies** (Farrell, Science) → Big theoretical reframing. * **Systematic Review of RAG Systems** → Anchors our ongoing RAG workshop arc. * **Mitigating Hallucinations via Causal Reasoning** → Directly relevant to reliability. * **When AIs Judge AIs (Agent-as-a-Judge evaluation)** → Critical for eval pipelines. * **Experience Embracing GenAI in an Engineering Computations Course** (Barba) → Powerful cautionary teaching case. * **Trustworthiness of Legal Considerations for LLMs in Education** → Important for Harvard context. * **Against the Uncritical Adoption of ‘AI’ in Academia** (Guest) → Critical counterpoint. ### Secondary (select for breakout groups or special interest) * **Drivel-ology** → Fun/rigorous benchmark for deep nonsense interpretation. * **REFRAG & RAG+** → Cutting-edge RAG efficiency advances. * **How much do language models memorize?** → Useful for privacy/data governance debates. * **Delegation to AI Can Increase Dishonest Behaviour** (Nature) → Sparks ethics discussion. * **Design Principles for GenAI Literacy** → Actionable for curriculum. * **The Use of GenAI in Academic Writing (Systematic Review)** → Connects with student writing pedagogy. --- 👉 Would you like me to **draft 1–2 page summaries** of the high-priority set (Farrell, RAG review, hallucination/causal reasoning, AIs-as-judges, Barba’s case, legal/education trust, and Guest’s critique)? That way you’d have a compact packet ready for printing. Here’s a clean bulleted rundown of the chunk you just shared, followed by my recommendations for which pieces are worth deeper summarization and printing for AI Lab discussion: --- ### 📑 Bulleted Summaries * **[Google’s NEW Agent Money Protocol (A2P) – YouTube, Sam Witteveen](https://www.youtube.com/watch?v=1bIVaODEbTo)** Video introducing **Agent-to-Payments (A2P)**, Google’s protocol for enabling financial transactions between AI agents. * Builds on A2A (agent-to-agent) and MCP (model context protocol). * Use cases: agents buying tickets (e.g. Taylor Swift), shopping & auto-purchasing, recurring purchases, discount-seeking. * Supports **human-present** (user confirms purchase) and **human-not-present** (agent acts with cryptographic proof of intent) modes. * Merchant/consumer tensions: loyalty programs vs user autonomy. * Core principles: openness, privacy by design, defined liability, cryptographic proof of intent. * Potential precursor to **agent app stores** and microtransactions between agents. * Google has published docs + GitHub repo for developers. * **[Locality in Image Diffusion Models Emerges from Data Statistics – arXiv:2509.09672](https://arxiv.org/abs/2509.09672)** Paper by Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann. * Challenges the view that convolutional inductive bias causes locality in diffusion models. * Shows instead that **locality emerges from data statistics** (pixel correlations in natural images). * Provides theoretical + experimental evidence using an optimal linear denoiser. * Proposes a new analytical denoiser that better matches UNet diffusion model scores. * 30 pages, 18 figures, 6 tables; strong fit for AI research discussion. * **[Teen Safety, Freedom, and Privacy – OpenAI (Sam Altman)](https://openai.com/index/teen-safety-freedom-and-privacy/)** Policy essay balancing three principles: * **Privacy**: AI conversations should be protected like doctor–patient or lawyer–client confidentiality. * **Freedom**: Adults should use AI with broad latitude (“treat adults like adults”), within safety bounds. * **Safety for Teens**: Stronger protections (age-prediction, stricter content limits, parental notification in cases of harm). * Acknowledges tension between values; advocates transparency in decision-making. * **[AI Companions Are Taking Over… Let’s Build One – YouTube, Fireship](https://www.youtube.com/watch?v=OfOPrmnHRxw)** Playful yet serious video about the rise of AI companions and building one. * Cultural framing: AI chatbots (like xAI’s “Annie”) replacing social/romantic connections. * Tutorial: builds a **voice-enabled Fireship bot** using Vapi (voice agents), Terso Cloud (database), Astro (frontend), 11 Labs (custom voice). * Demonstrates end-to-end pipeline: from prompt design to phone-call deployment. * Witty critique of “terminally online” culture alongside genuine technical demo. * **[How OpenAI Uses Codex – PDF](https://cdn.openai.com/pdf/6a2631dc-783e-479b-b1a4-af0cfbd38630/how-openai-uses-codex.pdf)** Internal OpenAI case study on Codex use across engineering teams. * **Use Cases**: code understanding, refactoring, performance optimization, test coverage, dev velocity, staying in flow, ideation. * **Anecdotes**: on-call debugging, mass refactors, auto-generated unit tests. * **Best Practices**: structured prompts, AGENTS.md context files, Best-of-N outputs, Codex task queues. * Shows Codex already deeply embedded in OpenAI workflows. * **[Large Language Muddle – n+1 Magazine, Issue 51 (Fall 2025)](https://www.nplusonemag.com/issue-51/the-intellectual-situation/large-language-muddle/)** Long-form cultural critique of LLMs’ impact on literary and academic life. * Surveys “AI-and-I” essay genre in the New Yorker, Times, etc. * Documents harms: homogenization, cognitive debt, declining student writing, AI “slop.” * Calls for resistance: stigmatization, refusal to publish AI writing, pedagogical redesign, unionization. * Frames AI writing as “single-use plastic of the mind.” * Ends with a Luddite call to “smash stereotypes of intellect and vision” imposed by AI. * **[Effects of Honor Code Reminders on Cheating in Unproctored Exams – ScienceDirect, 2023](https://www.sciencedirect.com/science/article/abs/pii/S0361476X2300067X)** Double-blind RCT study with Chinese university students. * Tested **policy reminders**, **exemplar reminders** (real cheating cases), and **consequence reminders** vs no reminder. * Found all reminder types significantly reduced cheating. * Suggests even familiar students benefit from pre-exam prompts reinforcing academic integrity. * Highlights importance of subtle nudges in promoting honesty. * **[Event Replay: How People Really Use ChatGPT – OpenAI Forum (Chatterji & Deming)](https://forum.openai.com/public/videos/event-replay-how-people-really-use-chatgpt-2025-08-27)** Discussion of major study on ChatGPT usage. * **Broad Adoption**: now \~10% of world population uses it weekly. * **Practical Use**: 80% of interactions fall into “asking, doing, expressing” (esp. guidance, info-seeking, writing). * **Demographic Gaps**: gender gap and income-country gap closing. * **Economic Impact**: creates large consumer surplus even if not captured in GDP. * **Future Work**: macroeconomic effects, education, real-time AI usage tracking. * **[AI Startups Are Overdone (Finally) – YouTube, Theo / t3.gg](https://www.youtube.com/watch?v=L3vToC1jO64)** Explains a **“vibe shift”** in Y Combinator and VC investments. * Critiques “copilot for X” startups built without domain expertise. * Highlights successful bets on niche, domain-deep companies (e.g. car wash SaaS, pro video editing tools). * Suggests future AI winners will blend AI competence with real-world expertise. * Also reflects on YC’s shrinking acceptance rates and investment dynamics. --- ### 📚 Recommendations for Summarization & Printing (AI Lab) For **AI Lab printing/discussion**, I recommend these as most strategically relevant: 1. **Google A2P protocol** → Key to future agentic economy + agent app stores. 2. **Image diffusion locality paper (arXiv)** → Technical advance; bridges AI theory + practice. 3. **OpenAI: Teen safety, freedom, privacy** → Policy implications; faculty/student concern. 4. **OpenAI internal Codex use** → Case study for embedding AI in workflows; very practical. 5. **Event Replay: How People Use ChatGPT (Deming & Chatterji)** → Large-scale empirical usage data, highly relevant for higher ed. 6. **n+1 “Large Language Muddle”** → Humanistic critique; frames existential risks for academia. 7. **Honor code cheating study** → Directly relevant to teaching/assessment integrity. The Fireship companion video and Theo’s YC critique are engaging but maybe better for **optional viewing**, not core printouts. --- Would you like me to prepare **concise 1–2 page printable summaries of each recommended piece** (standalone handouts), or a **compiled multi-article packet** that interweaves them into a single AI Lab briefing?